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Foreword 


his book distills the multivariate thinking of Robert Tryon and his stu- 
dents over a period of more than 30 years. Tryon's initial statement of 


Cluster analysis was presented in his 1939 monograph entitled ‘‘Cluster 
Analysis.” At that time, all computations had to be done by hand; Tryon 
was later to speak of his misspent youth, because too much of his time had 
been spent with a desk calculator. In the 1950s the practice of cluster analy- 
sis was restated in computer terms to enable the investigator to escape 
from hand calculations. Tryon and Bailey therefore planned this book to be 
the definitive account of postcomputer cluster analysis. The manuscript 
was almost finished when Tryon died suddenly in 1967. Bailey has perforce 
had to accept sole responsibility for the final revisions. 

Since the book is appearing posthumously, the reader may find it 
helpful to know about Robert Тгуоп'5 psychological career and his intellec- 
tual objectives in formulating cluster analysis. His academic record was a 
simple one. He received both bachelor's and doctor's degrees from the 
University of California, he was then invited to join the faculty of the uni- 
versity, and he stayed at Berkeley in the Department of Psychology for the 
rest of his life. His only lengthy absence was his war service for the Office of 


Strategic Services from 1941 to 1945 in Washington, D.C. 


vi 


His intellectual history was more complicated. His interests were 
broad, but they were linked in a unique way by his personal philosophy and 
his intellectual objectives. His first research contributions were in experi- 
mental psychology, but he was already an experimenter with a difference. 
Most experimentalists of the period focused on the construction of general 
theories of perception, learning, and so on, and viewed individual differ- 
ences in behavior as a nuisance. Tryon, however, like Galton and Binet 
before him, was interested in individual differences for their own sake. He 
thought that science should concern itself with the differences as well as 
the similarities in behavior. In particular, he wanted to know how far the 
differences could be explained by genetic inheritance and how far by the 
physical and social environment. Thus his research and writing span an 
area from behavioral genetics through abilities and personality character- 
istics to social psychology and social ecology. Three examples of cluster 
analysis are repeatedly used in this book. The first is concerned with abili- 
ties; the second, with personality measures; and the third, with social areas. 
These represent the range of his interests fairly well, except that his genetic 
interests are not included. Tryon seems to have been the founder of the field 
we now call behavioral genetics, and this was the area in which he first 
became widely known. He carried out extensive selective breeding experi- 
ments in rats. The differences in rats' skills in finding their way through 
mazes to food became the central phenomenon for analysis. After about 
three generations, bright and dull strains of maze learners had been éstab- 
lished. Although Tryon did not return to laboratory experiments after his 
departure for war service, he continued to retain a lively interest in behav- 
ioral genetics throughout his life, and students of his have since attained 
prominence in the field. 

An interest in individual differences is not, of course, unique to the 
twentieth century. Classification is usually one of the earliest forms of 
scientific activity, and it depends upon differences in individuals. As every- 
one knows, Aristotle delighted in classificatory systems. So did Linnaeus, 
who, without the aid of any measurement, devised a biological taxonomy in 
which membership of one class rather than another usually depended 
upon a single key attribute. 

Perhaps the main change in classification in the last 100 years has 
been a new emphasis upon measurement. The post-Darwinians began a 
search for a reliable method of quantifying association in measures of 
biological variability. Galton's discovery of the correlation coefficient in 1883 
brought this quest to fruition. Tables of correlations began to appear in 
scientific journals by the turn of the century. The way was then clear for the 
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multivariate methods, such as factor analysis and cluster analysis, which 


classify by the analysis of correlations. 
When Tryon finished his doctorate in 1928, the most generally accepted 


view of human ability was Spearman's two-factor theory. Spearman devised 
an early form of factor analysis (the analysis of tetrad differences) in an 
attempt to demonstrate in human beings the existence of an all-encom- 
passing ability which he called intelligence. He theorized that this would be 
present in differing degree in a wide variety of mental tests but that in each 
test there would also be represented an ability which was specific to that 
test. Spearman's two-factor theory and his methods of analysis had come 


under increasing criticism by the 19205. Godfrey Thomson had shown that 
any correlational proportionality which accorded with the two-factor theory 
han two-factor terms. Cyril Burt found how 


could also be explained in othert 
to extract further factors than the general one and, by so doing, demon- 


strated the need for a verbal factor as well as the factor of general intelli- 
gence if the observed correlations were to be explained. Tryon arrived on 
the professional scene just in time to add his criticisms of Spearman's 
theory, He postulated many abilities of low generality, because abilities 
were the product of multiple genes and diverse social environments, and 
he recognized the need for new methods to identify these lower-level abili- 
ties. Cluster analysis was his answer. 

In the early 1930s, Thurstone was the leading innovator in factor analy- 
sis. His method of multiple factor analysis was a much more satisfactory 
procedure for identifying abilities than Spearman's method had been, 
Tryon spent a sabbatical year at the University of Chicago in the later 1930s, 
He grasped the point, as many others at Chicago must have done, that simi- 
lar tests would have high correlations between them and that clusters of 
related tests could therefore be identified without the labor of a centroid 
factor analysis by direct search of the correlations. Thus cluster analysis, as 
originally conceived by Tryon, was a poor man's factor analysis. Practicality 
was always a guiding motive for him. In his animal studies, he had found a 
way of getting automatic maze readings to lighten the experimenter's load. 
He now wanted to save the immense labor that Thurstone and his associates 
were devoting to factorial studies. Tryon also wanted methods which relied 
as much as possible upon logic and as little as possible upon mathematics. 
This preference continues to be displayed in this book. Actually cluster 
analysis has many equations, but this book is written with rather few. The 
logic and the practice оссиру the center of the stage. 

When Tryon worked mathematically, his usual choice was for geome- 
try. He seems to have seen clusters in his space rather as astronomers see 
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galaxies in theirs. There has been some discussion in the psychological 
literature about who first thought of the multiple-group method of factor 
analysis, in which axes were put through the centers of clusters. Guttman, 
Holzinger, and Thurstone all seem to have discovered variants of the 
method at about the same time. Since Tryon had been at the University of 
Chicago shortly before then, my guess—and it is only a guess—is that 
Tryon's cluster approach was one stimulant to the multiple-group method. 
When computer programs later came to be written for cluster analysis, 
Tryon required the output to include configurations as well as numbers. 
Many illustrations will be found in this book. Inflatable balloons became a 
standard part of Tryon's baggage when he went to professional meetings 
in his later years. 

Whenever psychology and mathematics came into conflict, Tryon 
almost always preferred a realistic psychological solution to a neat mathe- 
matical one. One example is his insistence (in agreement with Thurstone) 
that there is no compelling psychological reason why factors or clusters 
should be at right angles to one another. If they are oblique, then we must 
use oblique axes despite the mathematical unpalatibility of doing so. 
Another example appears in Tryon's method for getting a score for each 
individual on each cluster. He was not patient with the elaborate regres- 
Sional techniques by which even the variables outside the cluster played 
their part in the complicated weighting scheme and through which the 
Scores on the different clusters were uncorrelated (or very nearly so). As 
Tryon saw it, the price of such mathematical elegance was one's not 
knowing exactly what one had measured. For the sake of psychological 
understanding, he decided to obtain the cluster Scores by straightforward 
addition, even though they were then correlated with one another. 

After Tryon's service in World War || he was chairman of his depart- 
ment for several years, so that one might have wondered in 1950 whether he 
would again contribute appreciably to multivariate methods. At the most, 
So it might have seemed, there would be need only for a revised version of 
the 1939 monograph. Then a computer became available to Tryon, and he 
entered upon another period of sustained effort in cluster analysis which 
ended only with his death. Tryon found that a computer made it possible 
for cluster analyses to be carried out as the logic demanded, without too 
many concessions to computational convenience. He realized, perhaps 
more firmly than any other Psychologist of his time, that his quantitative 
methods must be translated into a package of computer programs. Psy- 
chologists are indebted to the National Institutes of Health for their gener- 
ous financial support for Tryon's ambitious programming effort. This 
enabled him to collect a group of associates (often his former students) 
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who knew cluster analysis very well, were able to express its ideas in con- 
venient computer terms, and then to write the individual programs or to 
supervise their writing. The package, under the strange name of BC TRY 
(Bailey explains the origin of this name in the Preface), forms a model of its 
kind, and itis now widely used. In designing the package, Tryon was eclectic. 
He provided various options 50 that each investigator could use his own 
judgment in selecting whatever method and computer output he needed. 
Hence BC TRY is essentially a package in factor analysis as well as cluster 
Indeed, the distinction between the two has become somewhat 


analysis. 
flects his tolerance, which was one of his 


blurred. Tryon's eclecticism re 


strongest personality characteristics. 
Tryon and Bailey assume, correctly enough, that any reader of this 


book will have access to a computer. The account will be of only limited value 
to anyone who continues to use desk calculators. The text bristles with the 
new computer words—DAP, SPAN, OMARK and dozens of others—to which 
quantitative behavioral scientists must now become accustomed. Chapter 
3 gives an overview of the computer system, and Chap. 12 describes its 
Statistical and logical basis. (Incidentally, equations are more abundant in 
this later chapter.) Details of the computer control cards are given in Chap. 
13, where a User's Manual supplements the book. 

Years of labor went into BC TRY. We are reminded how much effort 
is needed to get our computer programs into good order. The task was 
periodically delayed by changes in the computer hardware. Bailey points 
out in the Preface that the cost to researchers of the frequent changes of 
computers made by almost all universities has been very great. | agree 
strongly with him. | do not think that computer administrators have always 
paid sufficient regard to the need for continuity in users' research. Programs 
have had to be rewritten too many times. Ithas become clear that computers 
should have been changed less often and that bigger leaps should have 
been taken in computer power whenever a change became inevitable. 
Furthermore, the outgoing machine should almost always have been left 
to work for longer periods of time alongside the incoming one which was 
Scheduled to replace it. Itis pleasing to know that Bailey has grappled firmly 
with this problem of programming continuity. He continues to update pro- 
grams and the Users’ Manual as computers change. Furthermore, he has 
planned financial arrangements to enable BC TRY to be modified to fit the 
computers of various manufacturers. Many of us have complained about 
the wastefulness of the present haphazard system by which each university 
duplicates the efforts of the others in building its independent library of 
basic computer programs. Interchange of programs has been erratic, qual- 
ity of those being exchanged has often been unsatisfactory, and far too 


much programming has been duplicated. A national library of behavioral 
science computer programs is clearly needed. Bailey has pioneered the way 
to this by his system for continuous maintenance and national distribution 
of BC TRY. 

The BC TRY package not only gives the user many options. Tryon and 
Bailey repeatedly urge the cluster analyst, after studying his initial results, 
to redefine his conditions and then make a further analysis. In desk calcula- 
tor days, no one ever wanted to repeat his analysis. With a computer, it has 
become easy for any researcher to do his work a second time, profiting from 
what he has learned in the initial analysis. Critics sometimes allege (with 
some justification) that the computer will dehumanize behavioral research 
by eliminating the opportunity for judgment іп the statistical analysis. Tryon 
and Bailey demonstrate unequivocally that this need not be so. 

І should like to conclude this foreword by paying public tribute to 
Robert Choate Tryon as a man. | found him almost invariably courteous, 
kind, humorous, and humane. He set for himself the ideals of a gentleman 
and a scholar, and he was wise as a counselor and of loyal and generous 


Spirit as a friend. / believe these were the generally accepted impressions of 
the man. 


Charles Wrigley 


Preface 


ІШ» book is designed to present an integrated theoretical and applied 
description of cluster analysis as conceived of by the late Robert C. 
Tryon and his colleagues over the span of more than 30 years. The material 
presented is the current state of the science, its methods, and research 
findings in a variety of application fields. The reader is given enough infor- 
mation to be able to comprehend and use the most sophisticated and elab- 
orate cluster-analysis procedures and to understand the complex results of 
cluster-analysis applications. Consequently, the book can be used as a text 
in both undergraduate and more advanced courses. The book is also 
intended as a guide for the more seasoned researcher just beginning to 
feel the need for cluster analysis in his own research. To the sophisticated 
scientist the book offers many new ideas in cluster analysis methodology 
and research findings. 

The range of topics discussed extends over the most basic concepts 
of observation of natural phenomena, the logical foundations for the cluster 
analysis of variables, the mathematical formulation of cluster analysis, the 
logical and operational bases of cluster analysis of objects, and the tech- 
nology of cluster analysis computer systems. The intent is as much sub- 
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stantive as methodological. The substantive content of the book is drawn 
from the behaviorial and social sciences. However, in the last few years, the 
science of cluster analysis has been discovered to be a valuable tool in the 
physical, economic, and biological sciences. We have witnessed the growth 
of the use of the methods of cluster analysis presented in this book to 
include such diverse scientific disciplines as chemistry, genetics, hydrology, 
zoology, glaciology, medicine, and soil morphology, in addition to the more 
expected disciplines such as psychology, education, sociology, anthropol- 
ову, and economics. 

The finished product presented here was conceived by Robert C. 
Tryon in the middle 1930s and pushed by him to its present form through 
many stages of sophistication and technology. In 1939, Tryon published a 
monograph, ‘‘Cluster Analysis," containing the broad outlines of this book. 
The methodology and theoretical principles of cluster analysis stated in that 
monograph were expressed in “oxcart” desk calculator "programs'' that 
taxed the patience and skill of the most ardent cluster analyst. That period 
culminated in the first laboriously worked-out example in the form of a 
monograph, “Identification of Social Areas by Cluster Analysis," by Tryon 
in 1955. Almost all the methodological principles in the current form of cluster 
analysis were anticipated or already had been achieved in that publication. 
The advent of the computer (an IBM 701 at the University of California in the 
fall of 1956) and the presence of Charles Wrigley at Berkeley а the same time 
stimulated a period of great activity and inventiveness in cluster analysis on 
the part of Tryon that ended only with his death in September, 1967. Tryon 
was able to capitalize on the stimulation from others working in allied areas 
at that time in Berkeley, principally Charles Wrigley, Henry Kaiser, Louis 
Guttman, and John Neuhaus. Out of these associations Tryon found tre- 
mendous inspirations and sources of motivation that led eventually to the 
modern concepts of cluster analysis presented in this book. | had the good 
fortune to bea graduate student at the initiation of this productive, exciting 
period and wrote, with the guidance and assistance of John Neuhaus, the 
first computer program to execute the central analysis of cluster analysis, a 
key-cluster analysis of variables from a correlation matrix (Tryon, 1958а). 

In 1959 we conceived of a general computer system to execute all the 
main forms of cluster and factor analysis as mere special cases or options 
within the general system. This effort was encouraged by Dr. Philip Sapir, 
of the National Institutes of Mental Health, and funds were awarded for 
Project CAP, Cluster Analysis Programs, in 1960 by the National Institutes of 
Mental Health to Professor Tryon. During subsequent years the staff of the 
project met regularly, exchanging ideas and developing the theoretical, 
statistical, and computer aspects of the cluster analysis system. Early in the 
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development it became necessary to attach a name to the system of com- 
puter programs. | suggested the name TRYON in recognition of the source 
of the entire development of cluster analysis. Professor Tryon, in modesty, 
demurred. However, the other staff members and | were able to convince 
him that the name TRY would be appropriate because of the experimental 
nature of much of the work we were doing; we were indeed going to try many 
ideas. Because of the computing center participation in the ІВМ consortium 
SHARE, the letters BC (standing for the Berkeley chapter of SHARE) were 
tacked onto the front of the name, and the somewhat appalling designation 
of the "BC TRY System'' has become 50 widespread that it probably cannot 
be changed. 

In the intervening 10 years we have seen many changes in computer 
technology, computers, and computer languages. All these changes have 
been painful to a certain extent, and we are not completely convinced that 
the overall result has been beneficial. We went from the IBM 701 to the ІВМ 
704, to the ІВМ 7090, the ІВМ 7094, to the CDC 6400 and at Colorado from 
the 7090 to the ІВМ 709. In the process, changes in computer operating 
Systems and programming languages have been more frequent than 
changes in computer. As a consequence, the productive work is somewhat 
less than half of what would have been accomplished under stable com- 
puter conditions. Perhaps that is progress, but it reminds one of the 
pioneers who struck out due west only to have to keep changing their 


route because of impassable mountain ranges and deserts. 
Many hands and minds have built the BC TRY System over the years 


since 1959. We are particularly indebted to the programmers and the pro- 
gramming supervisors (in which category | worked on project CAP for two 
years). Also critical in the development of the System and the production 
of the research leading to this book are the research assistants and secre- 
tarial staff of project CAP and its successors. It is impossible here to do jus- 
tice to the contributions. John Vinsonhaler played a key role in supervision 
of the programming on the 704 in its final stages and in the 7090 program- 
ming. Robert Russell was the main innovator of many special programming 
features of the System and wrote many of its programs, especially on the 
704 and the 7090. Many others made signal contributions to the program- 
ming, especially John Vinsonhaler, David Маша, and John Bauer (who isa 
collaborator with the authors in Chap. 13), Tom Kibler, and Eleanor Krasnow. 
In formulating the programming of BC TRY those most centrally involved 
were Professor Tryon, myself, John Vinsonhaler, Robert Russell, William 
Meredith, and John Bauer. Chen-lin Chu and Jin-Yu Yen (also a collaborator 
in Chap. 13) assisted Professor Tryon in performing many methodological 
studies that led to decisions incorporated in the programs. Others who have 
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made important contributions in development are Richard Burack, James 
Cameron, Mike Davidson, Don Flory, Robert Menhenett, Kent Mitchell, and 
John Wolfe. On the secretarial side we were valuably served by Valerie 
Siebert, Louis Ruhland, Sara Bailey, and Nancy Geidel. 

Dr. Harley B. Messinger read the completed manuscript and offered 
many valuable suggestions that led to improvements in the finished product. 
His assistance is deeply appreciated. 

| wish to thank the Editor of the journal Multivariate Behavioral 
Research for permission to reproduce copious material from the papers 
Professor Tryon and | published in the journal. Many of the chapters of this 
book are reworked versions of those papers. The Abridged User's Manual 
of the BC TRY System appears here as Chap. 13 by courtesy of Tryon- 
Bailey Associates, Inc. 

The support of the National Institutes of Mental Health in grants 
MH 0811 and MH 08314 is gratefully acknowledged. 

Our wives, Freida Tryon and Sara Bailey, deserve a special expression 
of gratitude. Their encouragement and interest played an important role 
in this book. Mrs. Tryon's active interest and participation was an important 
contribution to Professor Tryon's work. 

Itis fitting that the close of this preface and the dedication of this book 


be the statement appearing in the October, 1967, edition of Multivariate 
Behavioral Research: 


Robert C. Tryon, great scientist, true friend, died on the 27th day of Septem- 
ber, 1967. The lives of his colleagues and friends will never again be quite so 


full. The field of multivariate experimental psychology will never again be 
quite so rich. 


Daniel E. Bailey 
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CLUSTER ANALYSIS 


Chapter 1 


INTRODUCTION 


differences between the entities that compose it. By entities we 


mean either the objects or the properties of objects that constitute our 
world. In a study of the mental abilities of children, for example, the objects 
are the children, and their properties are the abilities in which they vary. 
Objects may be persons or animals, or subgroups of them, or any identifi- 
able physical or psychological structures. Their properties are the charac- 
teristics with respect to which the objects vary from each other and on the 
basis of which we differentiate them from each other; thus, properties are 
generally known as "variables" or "attributes" or in biology as ''charac- 
ters." Variables may be measurements on a continuum, or they may be 
discontinuous qualities. 

Cluster analysis is the general logic, formulated as a procedure, by 
which we objectively group together entities on the basis of their similari- 
ties and differences. When the entities compared are variables, such as 
the procedure is called ‘‘the cluster analysis of variables'' 
' V-analysis has its historical roots in the initial 


| | nderstanding our world requires conceptualizing the similarities and 


mental abilities, 
or, more simply, ‘‘V-analysis.’ 
work of two Englishmen, Karl Pearson (1901) and Charles Spearman (1904), ' 


! References in text are found alphabetized and dated in reference section at end of 
book. 


and the later developments of Godfrey Thomson (1916, 1951) and Cyril Burt 
(1915, 1941). Around the 1930s V-analysis took a special form of mathemati- 
cal dimensional analysis as the result of the work of Truman Kelley (1928), 
Karl Holzinger (1930, 1941), and Leon Thurstone (1931, 1947); Thurstone 
labeled the new approach ‘factor analysis," a term still used (Harman, 
1967) though strictly speaking it refers only to those mathematical proce- 
dures of V-analysis called ‘factoring’ (see Chap. 6). 

The “factors” derived from variables by the process of factoring are 
often interpreted as “underlying” the observed variables—as if they repre- 
sent genetic or psychological dispositions of persons. In the 1930s Tryon 
opposed this trend to reify factors (Tryon, 1932, 1935) and devised the 
procedures now called ''cluster analysis" (Tryon, 1939); this term was 
chosen to stress the fact that one can discover the general properties of 
objects by an objective clustering procedure of grouping variables without 
imputing causative underlying dynamics to the properties. 

The term cluster analysis applies equally well to entities that are 
objects, such as persons. The procedure of grouping together objects that 
have similar patterns of characteristics is called the "cluster analysis of 
objects" or simply, *'O-analysis." This is the field of typology, much older 
than V-analysis, having its scientific roots in the taxonomy of Linnaeus, in 
the diagnostic syndromes of physicians, and in genotype analysis of genet- 
icists. O-analysis is new аза quantitative field in biology, where it is known 
as “numerical taxonomy” (Sokal and Sneath, 1963). By factor analysts it is 
called “inverse factor analysis” or “Q-technique’’ (Stephenson, 1935; Burt, 
1937; Cattell, 1952, chap. 7). Factor analysts treat O-analysis merely as the 
application of dimensional factoring to objects treated as variables. So 
conceived, it plays a minor role in factor analysis and is even ignored in 
Some systematic treatments of the field (e.g., Harman, 1967). 

This book presents O-analysis as a highly developed form of cluster 
analysis in which the dimensional factoring approach is handled as only 
one method and, indeed, one that does not resemble orthodox Q-technique 
factor analysis. We can anticipate rapid development of O-analysis, as 
indicated by the cross-disciplinary Conference on Cluster Analysis held 
in New Orleans in 1966. 

From the above discussion it is clear that the term cluster analysis 
embraces both V-analysis and O-analysis. The specialized term factor analy- 
sis is retained in cluster analysis, where it is Properly allocated to its impor- 
tant, though subordinate, role as a dimensional procedure. Cluster analysis 
should not be called factor analysis, because factoring is only one of its 
subordinate methods. It should not be thought of as referring only to the 
clustering of objects because it embraces both V-analysis and O-analysis. 
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Since assessment of similarities and differences among entities is a 
universal conceptual problem, we can anticipate a rapidly expanding use 
of the procedures of cluster analysis in nearly all fields of human thought 
and scientific study. The real possibilities of its objective methods could not 
be capitalized on until the advent of modern digital computers because 
the observable properties of entities can be very numerous and the objects 
having such properties usually also occur in vast numbers; hence only the 
modern computer can manage V-analysis or O-analysis. Throughout this 
book, therefore, all analyses are executed by a computer. The computer 
programs employed are those of the BC TRY System of cluster and factor 
analysis, especially designed to handle all facets of V-analysis and O-analysis. 
When the first modern computer (1 BM 701) was introduced at the University 
of California in 1956, we immediately undertook the programming of the 
BC TRY System, and it has taken a full 10 years to think out and program 
the complete logic of the system, the developmental process being much 
delayed by changing programming languages and machines. Much of this 
book is a description of this computer system and its application. 

The intent of this book is as much substantive as methodological. 
Complete cluster-analytic procedures have been applied to three separate 
problems. The first is in the cognitive domain of the intellectual abilities, 
being the famous Holzinger study of the scores of 301 schoolchildren on 
24 diverse tests of verbal, spatial, speed, memory, and mathematical abili- 
ties. The second is in the well-known personality domain of self-conception, 
being the responses to the 566 items of the MMPI (Minnesota Multiphasic 
Personality Inventory) of 310 patients and normal adult subjects. The third 
isin the field of the ecology of metropolitan social areas, in which the objects 
are groups instead of individuals; they are the several hundred neighbor- 
hoods of the San Francisco Bay Area, observed both on demographic and 
voting-attitude characteristics before and after World War |. 

In order to give a suggestion of the full scope of a cluster analysis of 
the data of a given problem, in this chapter we first review the main analytic 
steps and findings of the Holzinger problem and then do the same, except 
more succinctly, for the MMPI and the social-area problems. The chapters 
of this book generally follow the summary analyses presented in this 


introduction. 


The Holzinger study of intellectual abilities 


Cluster analysis of the variables (V-analysis) 


When one observes 24 different intellectual abilities of many children, is it 
necessary to preserve all 24, or can a reduced number of composites of 


them fully account for all that is general among the 24 abilities? This is the 
question posed by V-analysis, and it is answered by applying the inherent 
cognitive processes by means of which we generally organize entities in 
terms of their similarities and differences. 

The degree of generality of individual differences in each of the 24 
abilities is revealed by the degree to which individual differences in it corre- 
spond to, i.e., correlate with, differences in the other 23 abilities. By the 
process of comparing each ability with every other ability, with respect to 
the similarity in which they order individual differences, a detailed state- 
ment of the relationships of each ability with the others is summarized in 
a "correlation matrix." This is taken up, with other basic measurement 
problems, in Chap. 2. 

One type of V-analysis is to form "rational composites'' out of the 24 
variables, grouping them by inspection into a priori content categories, 
such as verbal, speed, spatial, memory, and mathematical composites. 
But this logical grouping procedure takes no account of the correlations 
among the 24 abilities. Instead, we prefer to employ the process of empiri- 
cal grouping of the 24 variables, casting together into the same group those 
which correlate positively with each other, and especially those whose pat- 
terns of correlations with the other abilities are similar. Variables that form 
such like-patterned groups are called “collinear” clusters. They are the 
composites we seek in V-analysis, as is fully explained in Chap. 4. The con- 
trast between the rational- and the empirical-grouping approach is illus- 
trated in the Holzinger problem, where we find that though there appear 
to be five logical groupings of the 24 variables based on obvious content 
similarity, there are in fact only four collinear clusters among the 24 tests. 
It appears that the five tests of mathematical abilities are heterogeneous 
and redundant. 

By what process do we discover that we can properly reduce the 24 
abilities to the specific number of only four empirical clusters? We do so by 
the process of concentrating information, for which the procedures of 
"dimensional analysis" or ''factoring'" have been devised. These processes 
enable us to determine the number of dimensions, or clusters, that are 
sufficient to account for all the correlations among the 24 tests. The pro- 
cedure begins by deciding ahead of time the amount of variance among 
individuals that we want to account for by the reduced number of com- 
posites. For each test this variance is called its 'communality," an index 
of such critical importance that we devote Chap. 5 to it. The factoring 
procedure itself, developed in Chap. 6, reveals that only four salient cluster 


defined dimensions are sufficient to account for all the communalities of 
all 24 variables. 
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We must finally choose the defining variable of each of the four com- 
posites. We do so by the process of examining the total configuration of 
the relationships among the 24 variables. There are three main selection 
criteria: each cluster of variables should be (1) as "tight," i.e., collinear, as 
possible, (2) as nearly independent of the others as possible, and (3) 
able to account for as much general variability as possible. In Chap. 7, 
which considers this problem, it will be seen that there are simple com- 
puterized aids, both graphical and metric, that enable the analyst to 
observe the cluster structure among the 24 tests and to make the best 


selection of four composites. The composites turn out to be subsets of 


variables that define the four basic abilities of V (verbal), S (speed), F (form 


or space), and M (memory). 
Cluster analysis of objects (O-analysis or typology) 


The purpose of O-analysis in the Holzinger problem is to differentiate the 


301 children of the study into a set of salient ability types. Each type con- 
sists of children who have the same profile of scores on the four basic 
composite abilities found in the V-analysis, namely, on V, F, S, and M. In 
solving this problem we use the same four logical processes employed in 
discovering the clusters of variables in V-analysis, i.e., the processes of 
comparing, grouping, concentrating, and inspecting structure. 

By the process of comparing each child with every other child in their 
profiles on V, S, F, and M, we develop a similarity matrix among the children 
(analogous to the correlation matrix among variables in V-analysis). To 
secure this matrix we direct the computer to represent each child as a 
point in the four-dimensional score space of V, S, F, and M. In this space, 
the degree of similarity of any two children is defined by the distance 
between their two points, an index called the ‘‘euclidean distance 1928 

The process of grouping is implemented by directing the computer 
to search in the score space and find those concentrations of points which 
define distinctive types of children. 

The process of concentrating this information consists of directing 
the computer to find, by an iterative process, a minimally sufficient set of 
types that represents the whole configuration of children and to set aside 
any child whose score pattern is too unique to be included in any type. In 
the Holzinger data, our computer program locates 15 ability types, into 
which all but 16 unique children are cast. 

The final process of observing the whole structured configuration of 
children in relation to each other is most critical. The precise description 
of the whole configuration is the total matrix of D values between all the 


children, but since it is too complex to comprehend as such, various other 
metric and graphical aids (fully developed and illustrated in Chap. 8) are 
employed to help decide on the final set of interrelated types to be used 
in classifying the children. Among these aids is a hierarchical ordering of 
the types, such as the genus-species classification system of biology. 


Comparative cluster analysis of variables 
(VCOMP) and of objects (OCOMP) 


Just as in biology, where it is well known that different organic species and 
varieties live in different ecological settings and are thus selected for differ- 
ent characteristics, so ecologically different groups of human beings are 
likely to differ in their patterns of abilities. This point is demonstrated 
by a comparison of the cluster structure of the abilities of two different 
groups of children, one from a suburban school and the other from a 
School in a factory area. In Chap. 9, comparative procedures, called 
“СОМР analysis," are used to make indirect and direct comparisons of the 
cluster structure of the 24 abilities in the two school groups. They turn out, 
in fact, to be quite similar, except that the test clusters of the factory 
children are more independent of each other than those of the suburban 
children, a fact with both social and genetic implications. 

Though the structures of the abilities of factory and suburban chil- 
dren are fairly similar, it does not follow that when the typological struc- 
tures of these groups are compared, they, too, will be similar. OCOMP 
procedures, also presented in Chap. 9, are designed to find out. When we 
compare the two groups of children on a common array of ability types, the 
frequency of cases in the different types varies significantly; furthermore, 
when we perform independent typologies on the two groups and project 
the two arrays of types into the same analysis, we find some marked 
differences in the typological structures of the two groups. Such findings as 
these have obvious educational, occupational, and social implications, 
hitherto ignored for want of methods capable of revealing such salient 
differences. As a further illustration of comparative analysis of two typolo- 
Bles, a comparison between the ability typologies of boys and girls, long a 


matter of speculation in the psychology of sex differences, is also given 
in Chap. 9. 


Prediction from clusters of variables 
and from object clusters 
Suppose one wished to predict a child's mathematical ability from knowl- 
edge of his scores on the verbal, speed, form, and memory tests. One way 
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is univariate prediction, in which the predictor variable is a single score, 
either from a single test or from a composite of tests. In Chap. 10, on pre- 
diction in cluster analysis, we will see that individual tests are generally 
poorer predictors than the cluster composites discovered in V-analysis. It 
is also found that multivariate prediction, in which the predictors are all 
four cluster composites, V, S, F, and M, weighted by the multiple correla- 
tion methods, is generally better than univariate. But the superior form is 
differential typological prediction. This is the prediction of mathematical 
ability from knowledge of the distinctive ‘‘O-type profile pattern” of a 
child's scores on the four predictor cluster composites. For children hav- 
ing one distinctive type of pattern of scores on V, S, F, and M, we can pre- 
dict mathematical ability with considerable accuracy, whereas for those 
of another distinctive pattern type, our prediction is no better than chance. 
Thanks to the computer, we can also now realistically assess the role of 
sampling error in such predictions by means of Monte Carlo runs, from 
which we can determine with accuracy which O-type predictions are sig- 
nificant and do so without making the usual normal-curve assumptions 
that have made such estimations of error rather questionable in the past. 
One special merit of differential O-type prediction is that we can now 
cumulate knowledge about a given O-type from a discovery of those ‘‘out- 


side” attributes which are predictable from it. This form of prediction is a 
ns that each O-type is likely to have a highly 


new approach in science. It mea 
om that 


predictable but distinctive pattern of characteristics differing fr 
of other O-types. 


The personality domain of self-conception: the MMPI 


problem treated throughout this book deals with 


the personality domain of reporting on emotional and social reactions in 
oneself that might be maladaptive. The data set consists of the responses 
to the 566 items of the MMPI (Minnesota Multiphasic Personality Inven- 


tory) of 90 normal subjects and 220 psychiatric patients. 


The second substantive 


V-analysis and O-analysis 


This MMPI study nicely illustrates the perennial problem of reducing by 
V-analysis the large number of item-variables that comprise some tests, 
questionnaires, or other surveys. The BC TRY System includes programs 
that can perform such an analysis on as many as 2,000 variables. The 
special procedure for such a large-scale solution is called BIGNV analysis, 


the logic of which is presented in Chap. 11 and illustrated there on the 
MMPI problem. In principle, there is no limit to the number of variables 
that can be tackled by BIGNV analysis. In a nutshell, here are the steps. 
The computer calculates the communalities of all item-variables, orders 
them by magnitude, and casts out all items of trivial generality, since they 
can play no general role in differentiating persons from each other. Thus, 
about two-thirds of the 566 ММРІ item pool go out at the very beginning as 
sheer dross. The cluster structure of the remaining pool of good items is 
then computed in manageable subsamples and then merged to give, 
finally, the salient cluster structure of the total supply of all item-variables. 
The logic of this approach is as old as statistical theory: cumulate the 
results on successive subsamples until they converge. The converged 
results describe the cluster structure in the full pool. 

The special power of V-analysis to reveal the basic domains sampled 
by such a large pool of items as that of the ММРІ inventory is clearly evi- 
dent in this study. The empirical cluster search procedure finds seven 
item-clusters. The four most salient V-clusters are the disturbing condi- 
tions of I (introversion), B (body malfunctions), S (suspicion and mistrust), 
and T (tenson, worry, and fears). The four scores of subjects on the 
І, B, S, T item-clusters largely measure the common variance of the whole 
ММРІ item pool, though three additional clusters, D (depression), R (resent- 
ment), and A (autism) are also clearly evident; however, the scores on these 
three are largely redundant, i.e., predictable from scores on 1, B, S, and T. 
Each of these seven itern-clusters forms an objective scale, each with 
reliability coefficients approaching .90. 

When the inclusive group of 310 subjects is scored on the basic four 
1, B, S, T item-clusters and the O-analysis procedures are applied to them, 


a clear, meaningful typology of 14 types of person appears (Chap. 8), 
with no unique persons. 


Comparative analysis 


Perhaps the most dramatic discovery in the MMPI problem, and one that 
has a potent moral for cluster and factor analysis generally, is the finding 
(Chap. 9) from comparative V-analysis and O-analysis of the two subgroups 
of normals and patients that make up the subject sample. The results on 
these two groups are radically different from each other. This finding illus- 
trates how a factor analysis of an inclusive group may differ from factor 
analyses performed on the subgroups that compose it. This finding further 
proves that when the MMPI is cluster analyzed by the methods described 
in this book, this questionnaire turns out to have very potent validity in 
differentiating normal subjects from psychiatric patients. 


INTRODUCTION 9 
Prediction 


The power of prediction from V-clusters and O-clusters is sharply clear in 
this MMPI analysis (Chap. 10). When the subjects' scores on the three 
item-clusters D (depression), R (resentment), and A (autism) are predicted 
from their scores on the basic four, |, B, S, and T, we find, just as in the 
Holzinger study, that as one moves from univariate, to multivariate, to 
differential O-type prediction, the ability to predict sharply increases. 
Indeed, for some subjects having a distinctive profile on I, B, S, and T, we 
find their scores on D, R, and A are almost perfectly predictable. For 


others, of course, the prediction is poor. 


The social-area study 


The third substantive study analyzed in this book moves to a different 
groups rather than individuals. In our own case, 
sus tracts) making up the metropolitan 
San Francisco Bay Area in prewar 1940 and in the postwar period circa 
1950. No attempt is made in this introduction to review the findings of this 
social-area study in any detail. Suffice it to say here that in the following 
chapters the same types of analyses are made on the neighborhood data 
as on the individuals in the Holzinger and MMPI problems; but here are 
some highlights. The focal variables that describe the properties of these 
neighborhoods are 34 demographic characteristics of the neighborhoods 
published by the Census Bureau in the prewar and postwar decades of 
1940 and 1950. The findings of the V-analysis show that only three basic 
domains are sampled by the 34 characteristics, namely, F (family life), 
A (assimilation), and S (socioeconomic). Thus, if one knows the F, А, S 
scores of a neighborhood, one knows all there is to know generally about 
its demographic character (as reported in the census). 

In addition to the demographic features of the neighborhoods, data 
on the voting of the neighborhoods are also carried along in the analyses. 
Perhaps the most surprising discoveries are the findings (1) that the 
demographic and voting attitudes of neighborhoods have an identical 
tridimensional cluster structure, (2) that this structure remains constant 
over a decade despite the social upheavals of a great war, and (3) that, in 
typological prediction, if one knows the demographic pattern of a neighbor- 
hood, one can predict how it will vote on some issues before the election 
takes place—in some cases 15 years before the vote is taken—and in spite 
of the fact that the people voting are largely different people from those 
on whom the demographic data were collected 15 years before! 


type of object, namely, 
the groups are neighborhoods (cen 


Chapter 2 


GENERALITY OF 
INDIVIDUAL DIFFERENCES 


ince the purpose of cluster and factor analysis is to discover the gen- 
S eral properties of objects (V-analysis) and the general types into 
which the objects can be classed (O-analysis), we should agree at the start 
on what “generality” means. А general property, variable, or attribute of 
objects, e.g., an intellectual ability of children, is defined here as a com- 
posite of two or more variables that similarly order the subjects observed 
with respect to that property. The word "'similarly" appears consistently 
here in the definitions of generality; hence we need a clear understanding 
of what we mean objectively by similarity. 

This chapter is a preliminary exposition of how the similarities of 
properties and of objects are described and measured. In describing the 
similarity of two variables, the focus is on the variation among objects in 
each of the two variables, i.e., on individual differences. To be clear on 
what is meant by similarity we must be clear on what is meant by individual 
differences among objects. We therefore present some basic concepts on 
the observation and description of individual differences in a single attri- 
bute, or property. First we discuss the observation of individuals and then 
the means by which individual differences in an attribute are described. 
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Methods of describing the generality of individual differences are then 
discussed. 

In this treatment we illustrate the concepts mainly on the intellectual 
abilities of the Holzinger study, in particular on tests of the so-called 
“verbal” and “mathematical” abilities. This particular problem is interest- 
ing since it bears on the important question of whether there are two such 
general abilities or factors and, if so, the degree to which they are similar. 
Only the briefest account of the science of psychological measurement of 
individual differences can be given here, and only those psychometric 
principles which have direct relevance to the problem of assessing similarity 
of individual differences in two or more characteristics will be introduced. 

With the psychometric concepts behind us, we can address the basic 
designs by which cluster analysis and factor analysis reveal general traits 
and types of individuals. First we deal with methods of discovering and 
describing the degree of similarity of variables and of objects. Finally we 


discuss the graphical representation of similarity among variables and 
objects. 


The observation of individual differences in an attribute 


When we use the word "observation," we generally are interested in the 
principles by which some conceptualized property of objects comes to be 
represented by scores. For example, in the Holzinger study, the 24 different 
abilities are represented by scores obtained by observing the behavior of a 
group of children, 145 children in the instance described (see Table 4.1). 
The process may be a simple tabulation of the number of “correct” 
responses made to a task, such as the sentence completion task represent- 


ing a specific verbal ability. It is instructive to look carefully but in an ele- 
mentary fashion at this problem. 


Abstract and operational definitions 


The investigator first forms an abstract concept of what property or attri- 
bute of objects he wishes to observe. Thus, in the example of verbal 
ability, he decides that he wishes to observe a child's ability to complete, 
verbally, an idea that is only partially presented to him in the test situation. 
How this ability is expressed in a real situation requires a specification of 
the verbal conditions under which the abstractly defined ability is believed 
to emerge in a child in the observation situation. The investigator might 
think up some incomplete sentence that partially expresses an idea and 
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give a child a choice of words to complete the sentence, only one of which 
correctly does so. An example: 


At home she does not understand how to make 
1) difficulties, 2) quarrels, 3) arrangements, 4) labor. 


The format of the abstract and operational definition with regard to a 
“number” ability is much the same except that the incompleted idea is 
quantitative, as shown in this example: 


ME AN MM ^@ = 


The child completes the series by supplying the number, ‘out of 
his head," that correctly continues the series. АП 24 variables in the 
Holzinger study are defined in a similar manner. 

Knowing the abstract and operational definitions of each variable 
in a study is necessary for an understanding of the meaning of general 
clusters and types derived by V-analysis and O-analysis. For example, 
imagine we discover empirically that five of the six memory tests form a 
cluster, i.e., that they tend to order the subjects in the same way. The 
meaning of a single composite score on the five tests becomes clear only 
by carefully studying the abstract and operational definitions of the items. 


Scoring (coding) 
The operationally defined property of an object, like the responses of a 
person to a stimulus item, usually is assigned a number according to some 
principle like the degree of intensity or the ‘‘goodness"’ of the responses. 
The number-coding of responses is usually based on a key, sometimes 
arbitrarily decided upon by the investigator. Even more arbitrariness is 
introduced when stimulus items are cast in “objective” test form, like the 
multiple-choice designs of the illustrative vocabulary and mathematics 
items cited above. In such cases, responses are a priori prejudged "'right'' 
or “wrong” and are scored accordingly. 

With modern computer facilities this arbitrariness can be reduced. 
All the bit responses of a subject in any experimental test situation can be 
marked by the subject on scoring sheets, which can then go to an automatic 
photoreader that converts the responses to records to be input to a com- 
puter. Then, applying methods described in this book, those responses 
which are found empirically to cluster together can be composited together 
to form not an arbitrarily defined but an empirically discovered composite 
or variable. Thus, cluster and factor analysis procedures can be used to 
discover how to score the many seemingly chaotic responses of subjects. 


14 


The procedures designate what composite each item belongs to and the 
manner in which it should be scored. 


Describing individual differences in specific and general attributes 


The first steps in observing each separate attribute are themselves forms 
of condensing, reducing, or compositing of many entities, either by group- 
ing them together by definition or by some empirical clustering procedure. 
The procedures of V-analysis continue this initial process by condensing 
or compositing the scores on the observed variables into even larger com- 
posites. Larger composites can also be made by grouping the 24 tests into 
rational categories or by performing a V-analysis of all 24 tests and form- 
ing composites on empirically derived clusters. 

The observed raw scores of the individuals on each ability must be 
described on (or transformed to) a common scale, because the raw scores 
in the different tests may be scored in noncomparable units. Іп some 
studies, the different variables may be in such diverse metrics as inches, 
pounds, items correct, ratings, elapsed time, decibels, and soon. We need a 
common scaling that informs us directly of individual differences in the 
attribute. The scale generally accepted is the standard Score, written 


r= x (2.1) 


where X is the raw observed score of an individual on a given attribute, 
Х is the mean score on it, and cx is the standard deviation. For example, 
the first child in the suburban group has a score on the sentence comple- 
tion test of 17 items correct. The mean and standard deviation of all the 


children's scores on this test are 19 and 5, respectively, whence the first 
child's standard score is 


17 — 19 
r= = —.40 
5 


The standard score has a universal meaning: it tells where an indi- 
vidual stands in his group relative to the standard deviation of all the scores. 
Note that for an individual with a score equal to the mean, i.e., whose Х 
is the mean X, his standard score is zero, and for an individual at 1 stan- 
dard deviation below the mean, i.e., whose X = X — Ic, his standard 
score is —1.00. One knows from these reference points that the first child 
with r = —.40 is below the mean by nearly } a standard deviation. 
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With the raw scores of all individuals on the 24 tests converted to 
standard scores on each test, we are in a position to form any general 
composites we wish of the 24 scores merely by deciding what shall be the 
defining variables of each composite and then for each individual adding 
up his standard scores on those definers. 

Suppose we wish to reduce the total number of observed test scores 
to a smaller number of rational composites, e.g., the verbal composite 


consisting of the standard scores on five verbal tests, Ху Xs, . . e , Xs. 
The raw score on this composite would be 
Ү =t 66+ Ж + ors ә (2.2) 


This raw composite score may not be comparable to other composite 
scores, e.g., a raw composite score on four different (say vocabulary) 
tests, since the verbal composite has five scores added together whereas 
the vocabulary has only four, and the relationships among the scores in 
the composites may be different for the two composites. We convert the 
raw composite score in expression (2.2) to a standard score (calling it z 


for now) 


Y-—-) (2.3) 
су 

In this form, the standard scores on all composites are comparable, each 

having a mean of .00 and a standard deviation of 1.00. We could therefore 

tell a child's relative standing in all the rational composites in the Holzinger 


study by his standard scores. 
There is a final convention. To avoid carrying plus and minus signs 


On a subject's score we convert the standard score to a new scale score 
that has a mean of 50 and a standard deviation of 10, thus, 


Z = 10: + 50 (2.4) 


This standard score Z is the final form in which an object's standing 
in any given composite is described in cluster and factor analysis, whether 
the composites are rational groupings of variables or empirical cluster (or 
factor) scores. When we plot profiles of individuals in O-analysis, these Z 
scores are used. It is clear from Eqs. (2.1), (2.3), and (2.4) that standard 
scores, whether in 2 ог 2 form, are linear transformations of the raw scores 
and therefore take exactly the same form of distribution as the raw scores. 
Thus, if an investigator wishes to make a nonlinear transformation of the 
raw scores to another metric, such as to logarithms or normalized scores, 
the conversion of the raw scores is made before performing the z or Z 
transformation. 
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Describing the generality of individual differences in an attribute by its correlations 
Generality and correlation 


Variation among individuals in scores on a variable is said to be general if 
the rankings of objects on that variable correspond to a noticeable degree 
with their rankings on other variables. The traditional index of similarity 
of a variable with another variable is the well-known correlation coefficient 
т, Which takes the value of +1.00 if the correspondence is perfect, .00 if 
there is no generality, and — 1.00 if perfect but inverse (high scores in one 
attribute matching lows in the other). For example, in the Holzinger sample 
of 145 children the generality of the sentence completion variable is high- 
est with the other four verbal tests, its correlations with them being .65, 
.73, .63, and .68. However, variation among children in sentence comple- 
tion does not correspond as well with mathematics ability as indicated by 
the number completion test since the correlation between these two 
abilities is only .41. Perhaps if we formed a single composite score from 
Scores on all five verbal tests and correlated that score with an analogous 
composite of all five mathematics tests, the variation among children in 
these two broadened composites would correlate more, i.e., be more 
general. This is true: the correlation between scores on the verbal com- 
posite and on the mathematics composite is .65. 

What would be the correlation between the V, for verbal, and N, 
for numerical or mathematical, composites if we could increase the number 
of tests in each composite to a very large number covering the whole 
domain of children's verbal abilities and the entire domain of their mathe- 
matical abilities? This correlation between the verbal and mathematical 
domains (or factors) can be estimated to a close approximation by methods 
to be discussed later. The correlation between the ''extended'" composites 
is .76. This ceiling means that there would be considerable generality of 
individual difference across the verbal and mathematical abilities if we 
could secure extensive domain scores on these two abilities, but since the 
correlation would not be 1.00, it also means that the generality would not 
be perfect. Children equal in a domain score on verbal ability could still 
vary quite a bit in mathematical ability and vice versa. 

One difficulty with expressing a relationship as a single index number, 
like the correlation coefficient, is that it really does not communicate fully 
the complex meaning of the relationship between two variables. A graphi- 
cal display tells the story much better. To illustrate, Fig. 2.1 is the ‘‘scatter- 
gram" of the 145 scores of the suburban children on the composite Zv 
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FIGURE 2.1 ha 
Correlation scattergram and prediction tables of V (verbal) ability and N (mathematical) 
ability for 145 suburban children in the Holzinger study. 
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score on the five verbal tests, and the composite Zy of the five mathematics 
test scores. This scatter is a photographic reproduction of the grid pro- 
duced by the BC TRY component RSCAT. The horizontal scale is the Zy 
score axis, the vertical the Zx, and the circled entries are the number of 
children at each intersecting score point. This RSCAT scattergram is 
especially designed to show the standard score sectors in which individuals 
fall. Note that the vertical and horizontal lines of dots represent Z scores 
of 50 (at the means) and that vertical and horizontal lines consisting of 
dashes lie at + 1с апа +20. 


Correlation and prediction 


Study of this scattergram, which corresponds to a correlation of .65, reveals 
that it is fan-shaped, with the wide part of the fan at the lower left. This 
Structure means that one ability can be predicted by the other with greater 
accuracy in the high ranges of ability than in the middle and low ranges; 
i.e., very high scores on either ability represent a more general ability than 
lower scores do. The single index number of r — .65 is a gross summary 
statement of this relationship but tells us nothing about the fan. 

A computer program like RSCAT provides both the graphical and 
the metric facts of correlation and prediction. From study of this output, 
the meaning of correlation in terms of prediction can be quickly grasped. 
Above the scattergram to the left can be seen a table headed ''Prediction 
of Y from X." X stands for the 2, score on the verbal composite, Y for Zx 
on the mathematical composite. The program cuts the verbal score Х 
scale into 12 categories or slices and predicts the mathematics ) scores 
for the children in each slice. For example, for the three children in the 
top slice, number 12, whose mean X (verbal) score is 76.22 (2.6 sigmas 
above the mean verbal score of 50), the linear regression predicted Y 
(mathematics) score is 67.07 (1.7 sigmas above the mathematics mean of 
90). Note that the predicted mathematics score of the three children at 
the extreme has “regressed” toward the mean of 50 compared with their 
extreme predictor verbal score, i.e., to 67 from 76. A line through the 
linear predicted Y scores is a “line of regression” of )' on the X scores. 
The constants of the equation of this line are printed below the prediction 
table. The slope B is .65, which, it will be recalled, is the correlation 
coefficient. 

This slope gives us one meaning of r, namely, that when the subject's 
X and ) scores are converted to the same scale (> or Z), the best-fitting 
straight line through the predicted Y scores from V has a slope r. When 
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the swarm of person points in RSCAT all lies on this regression line, one 
can predict У perfectly from X, whence the slope of the line is 1.00. But 
when the swarm lies randomly over the whole scatter, then whatever the 
value of an X score, its predicted Y is the mean Y; hence the regression 
line is through the horizontal dotted mean } line, which has a slope of 
r = .00. (The other type of prediction is of X from Y: its prediction table 
is upper right in Fig. 2.1.) 

If the relationship of Y to X is not linear, then a smoothed line drawn 
through a plotting of the column of values headed ''Predicted У, Curv.” will 
reveal the fact. The existence of curvilinear prediction is also summarily 
revealed by the correlation ratio т printed at the upper right. When 7 is 
significantly greater than >, the relation is curvilinear. Since there are two 
predictions, Y from X and X from Y, there are two values of 7. In the 
illustration, both y's are only slightly higher than г, and so we know that 
the relation between these two abilities is essentially linear. The ‘‘unbiased”’ 
» is a corrected value of з (called ‘‘Tryon's eta"), which compensates for a 
tendency for у to be biased upward when the number of cases is small. 

The fan-shaped plot, showing that prediction is better in the higher 
reaches of the two abilities than in the lower, is metrically expressed in the 
"error of prediction," shown in the column "SE YX,” meaning ''standard 
error of predicting У from X." This value is the standard deviation of the 
Y values in the X slices, corrected for small sampling in the Unbiased 
column but uncorrected in the Biased column. The error of predicting the 
mathematics score is greater when the verbal scores are smaller than 
when larger, which is the same as saying that the plot is fan-shaped. This 
difference in the error of prediction at different levels (slices) of X is 
technically referred to as ''heteroscedasticity." If the errors in each slice 
were the same, the relation would be "'homoscedastic.'' 

Generally it is wise always to look at the scattergrams between com- 
posite scores derived by cluster or factor analysis. They show in greater 
detail the nature of the relationships between the composites than the 
correlation coefficients taken alone do. 

Not only are such vitally important matters as curvilinearity and 
heteroscedasticity revealed in the scattergrams, but simpler statistical 
features of individual differences in each of the composites are also 
revealed. Thus in the RSCAT presentation of Fig. 2.1, the histograms of 
the verbal and mathematics score can be drawn from the frequencies given 
in the extreme right columns, headed N, of the prediction tables (and in 
more detail at the top and right margins of the Scattergram). Also, the 
percentile ranks corresponding to every .200 score are printed at the ex- 
treme left and bottom margins. 
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Discovering and describing the degree of similarity of variables and of objects 


In V-analysis we form general composites of variables that are similar; in 
O-analysis we form general types composed of individual objects that аге 
similar. We do so by observing the objects in their relations to a common 
body of other referent entities. To the degree to which the patterns of 
their relations to these referents are the same we judge them to be similar. 


Similarity of variables in ordering individuals 
Take, first, the situation in which this format appears in the V-analysis of 
the Holzinger problem, namely, discovering and describing the degree of 
similarity of the 24 separate abilities. The schematized score matrix in 
Fig. 2.2, in which dashes represent observed scores, illustrates the data 
from which the rest of the analysis proceeds. In the schema of Fig. 2.2 the 
columns are the 24 tests set up as comparison entities. The rows are the 
145 children taken as common referent entities. The observations in the 
matrix are raw test scores. The degree of similarity of any two column vari- 
ables, for example, V7 (sentence completion) and N23 (number comple- 
tion), is the degree to which the patterns of children's scores in these two 
columns are similar, or correspond. The index calculated is the correlation 
coefficient, which is therefore the "index of similarity in ordering individ- 


uals." It turns out to be .41 for these two variables in the Holzinger sample 
of 145 children. 


Similarity of variables in sampling domains 


The indexes of correlation among all variables are set up in a paired- 
comparison matrix, schematized in Fig. 2.3. 


We can ask the similarity question about any two column variables in 
this matrix: What is the similarity of V7 and N23 in relation to all the vari- 
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Paired-comparison matrix, the matrix of correlations of 
comparison entities. Observations are the raw scores on 
each of the column variables. 


ables as a common body of row referent entities? Since the referent entities 
are now variables and not objects, we seek an index of the similarity of V7 
and N23 in the way the two variables relate to, or sample, the 24 test 
domains from which the array of test samples is drawn. The similarity is a 
function of the degree to which the columns of correlations of the two tests 
follow the same pattern. This is measured by an "index of proportionality" 
(see Chap. 4), also variously called "the index of similarity of domain 
sampling," “the interdomain correlation,” and "the common factor corre- 
lation." This similarity is expressed graphically as the spatial separation 
between variables when they are plotted as points in a geometric space, 
e.g., a sphere. 

The important matter here is not the details but the general design 
by which the similarity of any two entities is determined. The entities are 
expressed as two comparison entities in their relations to a specified body 
of common referent entities. The index of their similarity with respect to 
the referents is the degree to which they have the same pattern of obser- 
vations on the referents. In the case of the score matrix, the similarity 
index is their correlation. In the case of the correlation matrix it is the 


index of proportionality. 


Similarity of dimensions in the 


same and in different groups 


The same design is used in determining the degree of similarity of any two 
dimensions defined by composites of the variables, like the verbal and 
mathematical composites. In this case the column comparison entities 
are the composites, the row referent entities are the 24 variables, and the 
observations are the columns of correlations of the 24 variables with the 
composites. The measure of similarity of any two dimensions is therefore 
their index of similarity in domain sampling across the 24 tests. The dimen- 
sions can be those determined on different groups. As long as the row 
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referents of 24 variables are the same in the different groups, we can 
determine the similarity of dimensions expressed as column comparison 
entities. 


Similarity of objects in their profile patterns 


The same matrix formulation provides the basis in O-analysis of determin- 
ing the similarity of objects, such as the 145 suburban children. The data 
for У separate children are expressed as \ column comparison entities. 
In the Holzinger example the four cluster composite scores verbal (V), 
speed (S), form (F), and memory (M) are expressed as row referent vari- 
ables. The observations in the matrix are the standard Z scores on the 
four composites. The index of similarity of the profile patterns of any two 
children is the similarity in their two columns of Z scores. The actual index 
computed is D, the euclidean distance between the two children repre- 
sented as two points in the four-dimensional score space of V, S, F, and M. 
It is on the basis of the values of D among all the children in this score 
space that the typology of the group is formed. In comparative typological 
analysis the same matrix formulation is employed to discover the degree 
of similarity of typologies of different groups. 

This matrix design has quite universal application and lies at the root of 
V-analysis and O-analysis. With a little extrapolation it is the broad design 
by which we assess the degree of similarity and difference between the 
entities that compose our phenomenal world. 


Spherical representation of the relations among variables and objects 


From high school geometry most readers are already familiar with the 
cartesian coordinate system of plotting entities on a set of orthogonal axes. 
In cluster and factor analysis that method is inefficient as a means of 
depicting the relations among variables or objects because it does not 
utilize directly the special knowledge about the relations among entities 
gained from the processes of factoring and it is of limited practical use 
when the number of dimensions is three or more. 

In the V-analysis of the 24 Holzinger abilities, for example, we find, 
by factoring the correlation matrix, that the relations among the abilities 
can be adequately described in four dimensions. The factoring process 
results in coordinates on four dimensions for the 24 tests. If we restrict 
our attention to three-dimensional subspaces, we can build up an intuitive 
understanding of the larger spaces. Since “how to do it" details are given 
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in Chap. 7, we skip them here and ask the reader merely to look at the 
results. Looking forward to Figs. 7.3 and 7.4, it can be seen how the con- 
figured relations among the 24 abilities become revealed as a display of 
points on spheres (three-dimensional). Since, by factoring, the successive 
dimensions involve progressively less test variance and at a different rate 
for different variables, it is possible to depict the total configuration on 
subsets of spheres in such a fashion that the first sphere depicts a maxi- 
mal part of the configuration, the next one maximally picks up an addi- 
tional part of the configuration, and so on. Since ''marker'' variables carry 
over from one sphere to another, it usually happens that, however many 
dimensions there may be, the salient features of the configuration can be 
observed on only several spheres. 

When we come to O-analysis, the problem of depicting the relations 
among all the individual objects in a spherical configuration is more diffi- 
cult. In V-analysis the spatial surface distance between any two variables 
on a sphere is a function of the correlation between them. 

In O-analysis it is not immediately evident that for any two objects 
in score space (say any two children plotted as points in Fig. 2.1) the 
euclidean distance between them is monotonically related to the spatial 
surface distance between them as points in a spherical configuration. 
That this relationship does, in fact, hold can be demonstrated mathemati- 
cally and operationally. The BC TRY procedures of euclidean analysis 
(EUCO) map objects into score space perfectly. Imagine that the two scores 
Zı and 2; оп 45 subjects resulted in a subject space like that іп the upper 
part of Fig. 2.4. The two scores on each subject are input to EUCO, which 
computes a 45 by 45 matrix of intersubject distances in the subject space. 
Another program transforms this distance matrix to a correlation matrix, 
which is subjected to a full key-cluster analysis (as described in Chaps. 6 
and 7), finally revealing the configuration (in program SPAN) on the surface 
of a sphere, as shown in the lower part of Fig. 2.4. Except for a systematic 
distortion in which the points around the origin of the two axes Z, and Z» 
are spread out more than at the edges, the configuration given in cartesian 
Score space is preserved in the spherical representation. 
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THE BC TRY COMPUTER PROGRAM SYSTEM 


VE the methods outlined in the first two chapters are applied to 
large numbers of variables or objects (greater than seven or eight 
ter to do the actual analysis is 


and calculations implied in 
pect without the assistance 


perhaps), the advantage of having a compu 
clear. The very large number of comparisons 
V-analysis and O-analysis is a discouraging pros 


of a modern digital computer. 
Although a current trend toward acceptance of the computer by 


behavioral scientists is clear, it has been slow in developing. Part of the 
m the myth that the “electronic brain" somehow 


the scientist in a way harmful to the under- 
standing of data or the freedom of the investigator. The main resistance 
to learning about how to use computers stems from an equally irrational 
belief that computers can be used only if the user personally knows how 
to “program” the computer or at least knows the meaning of a host of 


resistance stems fro 
intrudes into the domain of 


fanciful terms and jargon. 
These irrational positions are, of course, quite irrelevant. It takes 


nothing special in intelligence or knowledge to become quite expert in 
applying computer programs. Learning to use a computer program in an 
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application to data takes less time and effort than learning how to operate 
a desk calculator. As a user of a computer program one need know nothing 
about computer programming proper. Anyone who wishes can leave the 
hardware to the engineers and the programming to programmers. 

In this day and age, however, it is not difficult to find ways and means 
of learning how to program a computer. Several books aimed at teaching 
behavioral scientists the arts and methods of computer programming are 
available (e.g., Lehman and Bailey, 1968). Since one of us (DEB) has been 
teaching computer programming to psychology students (undergraduate 
as well as graduate) for several years, we now know that in approximately 
32 hours of instructional time ordinary students can become reasonably 
adept at computer programming. Consequently, the fact that one need not 
know programming to make expert use of computer programs, such as 
the BC TRY System of cluster and factor analysis, should not inhibit anyone 


from learning how to program the computer if he finds it useful or interest- 
ing to do so. 


On computer use and computer programming 


To use a computer program with expertise requires only knowledge of a 
few simple facts. The program must perform the kind of analysis that is 
wanted on data like the data it is intended to analyze. How to punch the 
data into computer cards must be known, along with the control data the 
program requires to proceed with the analysis, e.g., the parameters 
describing features of the data such as their title, the number of variables, 
and the number of observations. If the program performs alternative kinds 
of analyses at some points in its work, control data may have to be provided 
(again, on punched cards prepared by the user) informing the program 
of the choices the user wishes to make. In many programs such choices 
can be left to the judgment of the programmer (expressed as part of the 
Program) by opting to have the program follow standard alternatives— 
optimal alternatives, or so one might hope. Where the alternatives chosen 
(either the user's or the programmer's) turn out not to be good choices, 
а reanalysis of the data by the program often is a matter of repunching a 
control card (a reanalysis in the days of paper-and-pencil programs would 
have been a serious matter). The main thing a user must know is the kind 
of analysis a program can perform and how to use the program, i.e., how 
to punch the control cards. 

Although the developmental stages of the programming process are 
of secondary interest to the program user, it is useful to have a general 
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understanding of the stages by which the analytical tools are provided. 
Table 3.1 schematizes the six major stages in the development and use 
of a program when there are a number of users. 

The first stage in developing a program is the conceptual formulation 
of an objective that needs to be achieved. The conceptualizer must specify, 
in a step-by-step logical fashion, the successive steps to attain the objec- 
tive. For example, the objective may be to compute the means of n 
variables. Almost anyone can lay out a step-by-step method by which this 
objective might be accomplished. The objective may be more complex, 
e.g., encompassing a hierarchy of sub-objectives such as ''(1) to compute 
the means and standard deviations of п variables, (2) to read all the scores 
of all № individuals on the » variables as they are punched on cards, (3) to 
record these scores on a magnetic tape to be called the Data Storage Tape 
(DST) so that programs used later can have access to them, and (4) to 
record the means and standard deviations on another tape, to be called 
an Intermediate Storage Tape (IST), for a similar use to that of the DST.” 

The hierarchy of objectives is really an extension of the simple objec- 
tive of finding the means. The solution of each sub-objective can be laid 


TABLE 3.1 STAGES IN THE DEVELOPMENT AND USE OF A MULTIPLE-USER 
COMPUTER PROGRAM 
Agent 


Stages Formal Result 


1. Verbal formulation of the 
purpose of the program and the algorithm) 
step-by-step method of 
achieving it 

2. Translating the formulation to a 
general symbolic language, е.в., 
Fortran), punched on cards 


Program design (prose Conceptualizer 


Source program Programmer 
(program listing) 


3. Translating the symbolic Object program Compiler (a program) 
language on the source deck to 
the specific machine language 
accepted by the computer 
Entry into program Computer center 


4. Storing the object deck for 
ready application by the com- 
puter to a particular data set 

5. Printed instructions on the use | User's description Programmer and 
of the program, giving purpose, | conceptualizer 
method, required control cards, 


library on tape or disk | depository 
(system tape) 


restrictions 

6. Calling the program by required 
control cards for the execution 
of it by the computer on а given 
data set 


Computer run User 
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out in a step-by-step fashion, and so can the sub-objectives themselves in 
relation to each other. This complex example is quite real; it represents the 
initial formulation stage of an actual program called the DAta Processor 
program, abbreviated DAP from the capitalized letters in its full title. 

When the step-by-step method of reaching a goal can be laid out 
conceptually, each element in each step can be expressed by a symbol. 
For example, from elementary statistics, ‘ќо find the mean of the -Y scores" 
is conventionally symbolized as EX/.N, where X means "add up," X 
means “the X scores," / means “апа divide by," and .V is the number of 
scores. The computer is, in simplest terms, a machine expressly designed 
to manipulate sequentially the symbols defining the successive stages that 
reach a desired goal. The symbols can refer to numbers, but they can also 
stand for other entities, and the manipulations of them, in addition to the 
usual arithmetic algebraic kinds, can be those of retrieving, ordering, trans- 
ferring, sorting, storing, rejecting, and many others. Since so many of 
man's mental and artistic achievements and emotional displays can be 
symbolically represented, it is clear why computer use has rapidly moved 
into nearly all fields of man's activities. 

A step-by-step procedure of achieving an objective is known as an 
“algorithm.” The job of the conceptualizer of a computer program breaks 
down, therefore, into specifically framing the algorithms of his intended 
program into a program design, also known as a prose (or verbal) algorithm. 

Stage 2 іп the development of a program involves the specialist known 
as a programmer. He translates the prose algorithm of the conceptualizer 
into a family of symbols, e.g., Fortran, that, with additional processing (in 
stage 3 of Table 3.1), can be manipulated in the desired fashion by the 
computer. As the result of interaction between conceptualizer and pro- 
grammer in stages 1 and 2, the final program design may take a radically 
different shape from that originally proposed by the conceptualizer. Also, 
in the welter of symbols that embody a program, mistakes, or ‘‘bugs,’’ can 
occur. The last operation of a programmer in the second stage is to 
“debug” the program. 

The remaining stages of development are perhaps apparent from 
Table 3.1. Because the symbols written down and punched on cards (the 
Source program) by the programmer are usually not those actually manipu- 
lated by the computer itself, they need to be translated into the machine 
language that the particular computer can manipulate. This translation 
is done by the computer itself, using a program called a “compiler.” The 
result of this is expressed as an "object program," which may be punched 
on cards by the computer or recorded into the “library” of the computer. 
When the program is compiled, the next step, stage 4, is to store it in a 
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program library (on tape or disk) from which, on call, it can be read for the 
purpose of executing the algorithm on a particular user's data set. 

If the user's description of stage 5, prepared jointly by conceptualizer 
and programmer, is clearly written, the user is in fact primarily (perhaps 
exclusively) interested in this description and in his own computer run 


in stage 6. 


The BC TRY System of programs 


The BC TRY System consists of about 30 different programs, some com- 
bination of which, in a particular sequence, executes an aspect of cluster 
and factor analysis on a particular data set. To use the System, all that the 
user needs to know is the particular sequence of programs to use and the 
control cards necessary for each program. For example, if he wants to 
have the scattergram between two variables computed, he need only know 
the sequence of programs necessary to get that scattergram. Looking in 
the User's Manual, he discovers that the scattergram program RSCAT 
must be preceded by program DAP but that DAP itself requires no other 
program preceding it. The system cards for DAP are preceded by a card 
punched START, which prepares the BC TRY System, located on a disk 
(or tape), for action. The computer determines that the BC TRY System is 
the one system among several available that it should select. When the 


user's job is activated, the computer is under the overall control of a 
'' and the user has indicated on his first 


superprogram, called a “monitor, 
or should call the BC TRY System. The 


cards (monitor cards) that the monit 
sequence needed to compute a scattergram is 


Monitor cards 
START card 

DAP card (followed b 
RSCAT card (followed by its control c: 


END card 


y its control cards and data deck) 
ard) 


More briefly, we write this sequence as 


JOB—STA RT—DAP—RSCAT—END 


BC TRY is an integrated system of programs designed on a general 
conceptual scheme to execute the grand algorithm by which a user dis- 
covers the general attributes and general types of individuals in a score 
matrix (Tryon, 1958b, 1959). A particular computer run accomplishes one 
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of the sub-objectives of the overall objective, executing a special-case solu- 
tion of the general logic of V-analysis and O-analysis. In a general way the 
many methods of factor analysis and cluster analysis have the common 
objective of discovering a minimal number of composites among a collec- 
tion of variables or objects. The composites can replace the full array of 
variables or objects without loss of generality, in the sense that the reduced 
set will reproduce the intercorrelations among the full array. 

The BC TRY System provides the user with many options on how to 
solve for a reduced set of dimensions and objects. For the user who is 
overwhelmed by the embarrassment of riches, the System offers standard 
options that have been found in many problems to give satisfactory, 
perhaps optimal, solutions. 

Though all the main forms of factor analysis are available in BC TRY 
to investigators who have predilections for one rather than another, the 
main emphasis is on the methods of key-cluster analysis based on the 
domain-sampling formulation of cluster and factor analysis (Tryon, 1959 
апа 19580). The emphasized methods will be denoted simply as "cluster 
analysis." 

As a preliminary to the procedures of cluster analysis, the data are 
first input to the computer in proper form. The initial step of V-analysis is 
to compute the correlations among all the variables. The proportion of the 
variances to be reproduced is next set in the diagonal cells of the r matrix 
(the inserted values are usually estimates of their communalities, though 
one can insert reliability coefficients or unities). Next, the dimensionality 
of the matrix is estimated by factoring on the most collinear subsets of 
variables. Sufficiency of the dimensions is then tested by computing 
residuals. Finally, an oblique cluster structure solution, called a ''direct 
oblique rotation'' in orthodox factor analysis, is described both in sta- 
tistical and in geometric terms. The investigator may then proceed to an 
O-analysis of his data, first scoring individuals on the several clusters and 
then by a series of five integrated steps isolating general types of indi- 
viduals. If he wishes to test the predictability of other '"'outside'' attributes 
from these O-types, components for doing so are available. 

BC TRY includes many other important supplemental procedures. 
These are components to perform comparative V-analysis (‘‘matching 
factors'") as well as comparative O-analysis; the latter enables опе to assess 
the forces of multivariate selection in different groups measured on the 
cluster defined dimensions. Special components permit V-analysis and 
O-analysis however many variables or subjects one has. For users who 
wish to perform the traditional kinds of factor analysis, including rotation, 
these are included in the System, and since these methods are programmed 
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as integral components of the System, one can compare the results of these 
analyses with those from key-cluster solutions. 

The computing system attains its various objectives through the con- 
trol of component programs by a General Executive Program (GEP). For 
example, when the System embarks on V-analysis, СЕР at the top hier- 
archical level initiates this analysis by passing control to a component that 
processes and stores the data. Control then returns to GEP, which in turn 
passes it to a second component program to compute a paired-comparison 
(correlation) matrix. After completion of this computation, control again 
passes back to the executive which then proceeds to pass control suc- 


cessively to further component programs, each of which embodies and 


completes a logical step of V-analysis. 

The System is somewhat more complex than stated above. Some of 
the components themselves are subhierarchies. Further, the investigator is 
permitted to chose the form of analysis he wishes the machine to undertake. 

Any type of cluster or factor analysis performed by the BC TRY Sys- 
tem is executed by separate components linked together in tandem. Each 
program itself is named by an alphameric (alphabetic and/or numerical) 
label like GEP, DAP, COR2, and so on. Identified only by these symbols, a 
sequence may at first seem to be unintelligible to the person learning 
about BC TRY. However, all parts fit into the general logic by which one 
reduces many specific attributes of individuals to a smaller number of 
general attributes, i.e., V-analysis, and in terms of which many particular 
individuals are reduced to a small number of general types, i.e., O-analysis. 
Presently the user must communicate his wishes by control cards both to 
the executive and to the components. In time it is hoped to simplify this 
interaction by making it necessary for the user to communicate only with 
the executive, which alone will manage the components without further 
intervention by the user, who would thus be rid of the onerous task of 
preparing component control cards. 

An illustration of how BC TRY executes the logical analytical pro- 
cedures іп a full-cycle cluster analysis of variables and of individuals will 
provide a broader acquaintance with some of the components that con- 
stitute the System. The successive stages of such an analysis are listed 
in Table 3.2. 

Some procedures are “compounds,” consisting of linked programs. 
There are two stages labeled Stop (steps 6 and 10) where the investigator 
will normally, though not necessarily, intervene and size up what the Sys- 
tem has produced and, if indicated, direct the System to rework some steps. 

Early components in а sequence produce results that are utilized 
by later components. These events are achieved through the mediation 
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TABLE 32 STANDARD CLUSTER ANALYSIS OF VARIABLES AND OF INDIVIDUALS 


Stages of analysis 


A. Input of raw data 


1 


. Preparing the n variables on .V individuals for processing by later 
components 


a. Setting the data on the Data Storage Tape (DST) 
b. When n or N are very large, choosing random or forced samples 


B. V-analysis: cluster analysis of variables 


2 


w 


. Determining generality of variables: correlation matrix (missing 
data require COR3) 

. Deciding on variance to be factored 

. Performing dimensionality analysis (factoring) 
a. Selecting maximally collinear dimension-defining clusters and 

computing orthogonal factor coefficients on them 

b. Computing residuals to test dimensional sufficiency 

. Describing the oblique structure of the dimension-defining 
clusters (and of dependent clusters, if any) 
a. By statistical quantities (а ‘direct solution’) 
b. By a geometric configuration 

. Stop. After study of CSA and SPAN, revising the cluster selection 
by redefining the subsets (including hierarchical condensation, 
i.e., higher-order analysis) if necessary and repeating steps 4 or 5 


C. O-analysis: cluster analysis of individuals (objects) 


7 
8 


9 


. Scoring individuals on the oblique dimension-defining clusters 
. Linear and nonlinear relating of cluster scores and two-dimen- 
sional O-analysis on correlation scattergrams 
. Selecting core O-types from cluster score space 
a. Calculating euclidean distances between individuals 
b. Introducing marker individuals in the configuration 
c. Selecting core O-types by dimensionality and structure analysis 


. Stop. Revision of core O-analysis, as in stage 6 
. Determining O-clusters by identifying each individual with a core 
O-type, however large N is 


. Describing the cluster score pattern and homogeneity of resulting 
O-clusters 


. Nonlinear predicting of ‘outside’ attributes from the categorical 
series of O-clusters 


Component 


DAP 
PICK 


COR2 
DVP 


cc 
сс 


CSA 
SPAN 


FACS 
RSCAT 


EUCO 
OMARK 
NC 
analysis 


OTYPE 


OSTAT 


4СА5Т 


of the Intermediate Storage Таре (IST); the components write their own 
unique results on IST, and from it they read what they need. Just how this 


works in the full sequence of Table 3.2 down through step 7 is illustrated 
in Table 3.3. 


Under the column headed Component Program, the first component 
is a data processor called DAP. DAP sets up the raw data on the Data Stor- 
age Tape (DST), computes the means (MEANSI) and standard deviations 
(STDEV1) of all the » variables, and outputs these constants to the storage 
tape, IST, as shown by the two O's opposite DAP in Table 3.3. In the 
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TABLE 3.3 HOW COMPONENT PROGRAMS PRODUCE AND UTILIZE FILES? 


І | 1 
Orthog- | 
| onal | Oblique 
| Diago- | Dimen- | factor | factor r 
Corre- nal sion | Coeffi- | Coeffi- between 
с Means Sigmas lations | Values Definers| cient | cient |Clusters 
ompo- | | 


nent ria MIT | 
Program | MEANS1 STDEV1 | CORRM1 DIAGV1 | REFLX1 | UFACTI | RFACT1 | BASISI 
Е | > —| 


DAP (for | | | | 
DST) [e o 

COR2 І o o | 

DVP І 0 

сс | 

CSA І 

SPAN | | | | 

FACS І І І ІЛЕДІ d 


input to it, (у = optional input to the 


| | | о о 


"A O = output by the component, іт 
component. 


MEANSI column, it is seen that means аге later used in calculating correla- 
tions (1 opposite COR2) and much later in computing cluster or factor 
scores (І opposite FACS). But, as shown in the STDEV1 column the stan- 
dard deviations are not used by the correlation program, COR2, since there 
is no | in the COR2 row under STDEV1. COR? itself computes standard 
deviations for the second time (O opposite COR2); these are used much 
later by FACS. And so the accumulation of results goes on as the opera- 
tion proceeds down the table, until the time comes to compute scares. on 
the oblique clusters discovered in the V-analysis. FACS leans heavily on 
results from almost all the prior comp | 

Table 3.3 gives only а few examples of the complex manipulation of 
files by BC TRY. Other procedures are listed in Table 3.4, which also gives 
the symbolic name of the component of BC TRY that does the work. 

All the main forms of orthodox factor analysis (Harman, 1967; 
Thomson, 1951) can be computed by the BC TRY System, as shown in 
Table 3.5. Actually, these types are merely different forms of factoring 
(Table 3.2, step 4), which is only one stage in the full-cycle cluster analysis. 
Except for square root factoring, these orthodox forms of factor analysis 
define dimensions by different patterns of weights on all » variables. 
Such dimensions are usually difficult to understand. Indirect, or ‘‘derived,”’ 
rotations (Harman, 1967, pt. ІП) are therefore resorted to, usually by 
varimax or quartimax. 

The components that execute the procedures of Tables 3.2 to 3.5 are 
statistical and logical programs, given the symbolic name of STANALOGS 


onents. 


in the BC TRY System. 
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Finally, there arises the problem of making the BC TRY System avail- 
able to users. Official versions of the BC TRY System are being made avail- 
able to computer centers at universities and research facilities. However, 
the problem is complicated not only because the system provides a great 
wealth of interconnected components but also because the components 
are being constantly revised and will be programmed on different com- 
puters. This problem is being met in several ways. 

Accompanying the System is a User's Manual written from punched 
cards onto magnetic tape and dated to refer to a particular version in a 
particular language on a particular machine. The cards and tapes can 
easily be changed as versions change. Multiple printout copies can be 
made available to local users quickly and cheaply. 

To give some idea of the speeds of computation, some time estimates 
taken from the detailed time charts in the IBM 7090 User's Manual system 
tape storage are shown below. The estimate for a 25-variable standard 
cluster analysis by steps 1 to 5 given in Table 3.2 for this direct oblique 


TABLE 3.4 SUPPLEMENTAL PROCEDURES OF BC TRY 


Procedure Name 


A. Comparative analyses 
1. Comparing cluster defined dimensions discovered by 
V-analysis of different groups (or of the same group by 


different factoring methods) COMP” 
2. Comparing the O-clusters discovered in different groups, i.e., 
assessing multivariate selection ОСОМР“ 


B. Large-scale V-analysis and O-analysis 
3. However large n or .V, converging in random subsamples on 
the cluster defined dimensions in the full domain of n varia- 


bles or on the O-clusters in the full supply of .V individuals BIGNV^ 
C. Miscellaneous 


4. Suppressing designated variables during factoring but reacti- 
vating them in a summary analysis SLEP* 
5. Dimensional and oblique structure analysis without 
communalities NC,NCSA 
6. Dimensional analysis on preset dimension-defining clusters 
including designed and higher-order (hierarchical) analysis CC, NC 
7 


· Computing communalities in any one of six different ways and 
utilizing them (or reliability coefficients and unities) as 
diagonal values in the correlation matrix DVP 


8. Rational nondimensional cluster analysis, e.g., item analysis CSA, NCSA 
9. Multiple correlation and regression of dependent clusters on 
the minimal set of dimension-defining clusters SMIS 
10. Missing data management COR3, FACS3 


“Linked programs. 
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TABLE 3.5 ORTHODOX FACTOR ANALYSES BY BC TRY 


Procedure Name 


A. Simple-sum factoring 
1. Thurstone centroid for п X 20; salient centroid for 
n » 20 CC(CENT) 


2. Bifactor analysis (first dimension a centroid, remain- 


ing dimensions key cluster) CC(CENT+) 
3. Square root or diagonal factoring (or pivot-variable 

analysis) CC(PV) 

B. Least-squares total-set factoring and auxiliaries 

4. Principal-component or principal-axes factoring FALS(PFA) 
5. Canonical factoring FALS(CFA) 
6. Augmented (or «) factoring (also by simple-sum 

factoring) FALS(AFA) 
7. Residuals from least-squares (or any other) 

factoring FAST(RESID) 
8. Reproduced correlation matrix from least squares 

(or any other) factoring FAST(REPRODUCED) 
9. Rotation of total-set factors (any type) varimax or 

quartimax GYRO 
10. Regression scores of orthogonal or oblique total-set 

factors FACS 
11. Comparison of total-set factors in different groups 

SIMRO 


with those of a best-fitting population 


solution is 

DAP—COR2—CC—CSA—SPAN: 1.74 min 
s time with that required of two orthodox indirect 
solutions which the user obtains by simply 
replacing the CC-CSA segment by distinctive calls that define these two 
solutions. First are shown the times for the cluster analysis segments, 


followed by those for the defining segments of a centroid-quartimax 
ution (the GYRO components could be 


Compare thi 
orthogonal “factor analysis 


and a principal-axes varimax sol 
interchanged): 


Cluster analysis | Centroid-quartimax Principal-axes varimax 
Segment Time, min | Segment Time, min Segment Time, min 
Be | 45 | CC(CENT) 45 FALS(PFA) 142 
CSA 37 | GYRO(QRTMAX) 81 FAST(RESID) 93 

| 


| GYRO(VARMAX) 13 
82 | 126 3 66 
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The slower principal-axes speed of nearly 4 min. is required despite the 
use of Householder-Ortega-Wilkinson (HOW) lemmas in the subroutine 
that computes eigenvalues and eigenvectors with great speed and accuracy 
on matrices up to 120 variables. 

Such time comparisons can be quite misleading due to vitiating 
"overhead" hardware features. Thus, much of the above slowness of the 
principal-axes segment is an accidental result of the location of its compo- 
nents at the end of the system tape. When corrections for tape positioning 
are made (or actually achieved in disk storage), the principal-axes segment 
takes only about one-third more time than the cluster analysis, which is 
itself much speeded up. 

Other practical restrictions are these: input is restricted to 100 vari- 
ables, more or less, depending on the installation on no more than 9,999 
subjects (5,000 on some installations). Factored dimensions cannot exceed 
15. As machines increase in capacity, these limits can, of course, be raised. 
But even now, when one uses the designs of BIGNV (see Table 3.4), there 
is really no limit on numbers of variables or subjects. 


Systematic computing procedures in BC TRY 


This section describes features of the BC TRY System of interest to those 
engaged in the development and use of computing procedures in the 
behavioral sciences. The computing science features described here are a 
varied lot, some simple, others complex in their logic and execution. 
Although many of the details of the procedures are of interest, treatment 
here will deal only with general aspects. 


Central storage of programs 


The entire system of component programs (30 at the time this section was 
written) is stored on a tape (or disk) in a linkage or overlay mode (depend- 
ing on the computer) that allows any of the components to be called into 
execution at any time by any of the components. In this way the user of 
the System never handles the program decks. The linkage tape is on 
deposit with the machine operators, who mount it at request of the user; 
or the linked programs are permanently stored on a disk. The request for 
the BC TRY System is made via a short program (ACCESS) available to the 
user as an object deck. Hence no technical computing knowledge is 
required of users of the System. 


The most significant result of this system of program storage and 
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access is integration and linkage of the separate component programs in 
the System. Because the System is unified as a collection of links, access 
to the System implies access to each of the components. Executive- 
program operation features and the data sharing facilities interact with 
the mode of program storage to produce, in effect, a program of some 
500,000 machine words with a practically unlimited amount of common 
storage and flexibility in determining the pathway through the System at 
execution time. 

Two modes of transition from one component to another in the Sys- 
tem are provided. In the most general mode the GEP interrogates the 
monitor input unit for instructions regarding the component to set into 
execution. GEP then calls, via the linkage (CHAIN, OVERLAY) subroutine, 
the designated program from the unit on which the system is resident. 
Under normal conditions the STANALOG component in operation exits on 
termination of its execution with a call for the GEP. This is executed via the 
linkage subroutine. If an error occurs in calculation or control card or 
there is machine failure (in most cases except errors which cause machine 


halts directly), each component will call the system component ERGIVE, 
o binary cards so that 


which punches the entire contents of the IST ont 
gram that terminated 


the computations can be restarted with the pro 


execution in an error. 
The general executive program (GEP) 


The primary component of BC TRY, as a system, is the General Executive 
Program (GEP). This program presently operates as a submonitor. Once 
the monitor initiates the execution of GEP, the GEP performs several pre- 
liminary operations and then monitors the sequence of components in 
their execution as determined by executive control cards on the monitor 
input unit. 

The initiation features of the GEP result in the establishment and 
checking of the assignment of input-output (1 O) units by unit designations 
for symbolic functions. The details of this are described below (DYNATAPE). 
In addition, the positioning and initiation of the storage units are overseen 
by GEP. Perhaps most important is GEP's role in setting up tables and 
common storage regions containing information about the location of 
component programs on the BC TRY System and the files of computed 
data on IST. 

In its current form, the GEP recognizes several specific cards, each 
with a distinctive form. Correlated with recognition the GEP performs the 


pertinent operations, the first being the initiation functions discussed 
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above. Second, the ВЕР prepares for EXIT to monitor by rewinding tapes, 
unloading the BC TRY System tape, and attending to other minor duties. 
Third and fourth, the contents of the IST can be punched in binary on 
cards or be made up from similar cards read by the executive. Fifth, the 
entire contents of the IST can be printed. Sixth, the GEP can call programs 
as links from a second chain tape in order to execute programs not on the 
BC TRY System tape. Seventh, the GEP interprets the name of a compo- 
nent program, e.g., COR2, the correlation program, оп the executive con- 
trol card and calls the indicated link on the BC TRY chain tape. Should an 
error occur, e.g., an incorrectly punched executive control card or a non- 
executive control card occurring where an executive control card is 
expected, the GEP enters an error mode, resulting in the punching of the 
contents of IST onto binary cards. 

An important feature of the System is the ability to modify the System 
without handling the deck of binary cards for the entire System. This 
feature also saves a great deal of computer time. Only those components 
which have been modified need be dealt with. The EDIT option remakes 
the chain tape, replacing obsolete links with their modified versions. 


Data sharing 


In the early conceptions of BC TRY a fixed order of program execution was 
used. Whatever data were calculated by a program and needed by suc- 
ceeding programs were recorded seriatim by files on a binary tape. When 
the data were needed for the subsequent calculations, they were simply 
read from that tape. The location of the data was fixed by the file number. 
Consequently if program PPP needed the data calculated by program 
000, PPP had to contain information regarding the location, i.e., the 
file number, say XXX, of the data. This information was provided by having 
000 always record the data as file XXX. This method required some pro- 
&rams to write dummy files or to restructure the binary storage tape, 
Which was a satisfactory procedure until the full force of the modularity 
of the master design became apparent. To achieve the grand design and 
have a truly superprogram with virtually unlimited flexibility in the serial 
use of the programs we had to abandon the rigid seriatim nature of our 
file keeping. As often seems the case, the problem and its solution were 
proposed almost at the same time. We were trying to solve the file-keeping 
problem at the time Fortran || was introduced. Without Fortran II and its 
CHAIN and COMMON features the solution might have been delayed 
indefinitely. 


The CHAIN and COMMON features of Fortran II allow а program on 
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the CHAIN tape to be loaded into core without disrupting selected seg- 
ments of the COMMON portion of the core. This core area is called ISTCOM 
in the BC TRY System and contains a table of the symbolic names of files 
currently available on the IST. The System has since been translated to 
Fortran IV and uses the corresponding features COM MON and OVERLAY. 

The data-sharing system of BC TRY is a method for saving selected 
intermediate calculations as files on a binary tape or disk. These files 
include logical information (as in variable names, titles, etc.), vectors, 
matrices, and lists. A number of standard forms and parameters are 
defined. Corresponding subroutines which read or write these forms and 
parameters are called the IST subroutines and automatically transmit the 
respective quantities to and from the IST. 

When a file is transmitted to the IST, a unique character label is 
given to it. This label is written on the tape with the rest of the information 


of the file and is read when the file is read. The labels are defined by the 


programmer at the time the component program is written and are 


associated with information generated by the program. 
There are six basic elements to the logic and mechanics of IST usage: 


1. The information involved, which may be generated by more than one com- 


ponent program 
The information label 
Parameters associated with 
The specific IST subroutine used to transmit the in 
The format of the information to be transmitted 
The array of file labels and the associated array of file numbers 


the form of the information 
formation 


пором 


tion is of potential use in later stages of analysis (as 
determined at the time the program was written), it is recorded as a file 
on the IST at the end of the string of files already on the tape. In addition, 
the symbolic name, or label, of that file and the physical file number of 
the file on the IST are recorded in a pair of locations in the ISTCOM area. 
er point in the job, requires a file with the label 
gates ISTCOM. The physical file number of the 
esired is determined by matching the 
th a label in ISTCOM and hence with a 


When informa 


When a program, at a lat 
of a file on IST, it interro 
file containing the information d 
label of the desired information wi 


physical file number. 
Recovering a file of information from IST does not alter the contents 


of the ISTCOM or IST. However, when transmitting a file of information to 
IST, two possibilities must be considered. If there is no old file with the 
same label as the new file, the new file is added to IST and the new label 
and physical file number are added to ISTCOM. If a file with the same label 
as the new file is already on IST (and hence its label is in ISTCOM), the 
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new file is added to IST and the physical Не number of the new file 
replaces the physical file number of the old file. 

Updating the physical file numbers in ISTCOM implies that the System 
does not have old files available once a new file with the same label is 
generated. This is a distinct advantage in that files of the same name and 
hence confusable are never present simultaneously excepting as dead or 
nonaccessible information on the IST. However, one disadvantage is 
inescapable: old information in physical files having the same labels as new 
information in other physical files is not available to most components of 
the System. This is vitiated in large part by the data retrieval components 
of the System discussed below. 

An important aspect of this system is the recording in the file itself 
of the parameters determining the specific character of the information 
in the file. The number of rows and columns in a matrix, the number of 
records in the file, the number of elements in a list, the format of the file, 
etc., are all indicated in the first few words of the file. The advantage of this 
recording form is that the component needing a certain kind of information 
generated by another program need not have information regarding the 
Specifics. For example, the rotation component GYRO need not have the 


number of factors before interrogating IST to obtain the factor coefficients 
to be rotated. 


Data transmission 


In addition to providing within-system sharing of data, BC TRY permits a 
full range of user-system data sharing procedures. Most of these proce- 
dures are involved in saving data from calculations in a convenient and 
usable form. However, the System provides for direct intervention by the 
user in the data sharing aspects of the system during execution. These 
features involve an error initiated procedure giving a binary deck reproduc- 
tion of the IST which can at a later date be used to begin the job where it 
was terminated by the error. In addition, the user can produce the same 
binary deck for the IST by calling for it through the СЕР. When such а 
deck is in turn encountered by the СЕР, the IST and ISTCOM are recon- 
structed to correspond to the ISTCOM and the IST files current at the 
time GIVE was invoked. This option of ВЕР is especially useful where two 
sets of calculations based on one set of intermediate data are planned (as 
in two types of analyses of a single correlation matrix). After calculating 
the basic data the IST is punched through a call to GIVE. The second 
analysis avoids recalculation of the basic data by constructing the IST 
with the binary cards obtained with the earlier call to GIVE. 
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The user can introduce individual files in IST by the use of a separate 
component of the System, GIST. Hence if a file is used by a component and 
it has not been generated in the sequence of calculations already executed, 
it can be made available. Also if special information is required, e.g., pre- 
defined clusters in a cluster analysis, previously calculated factor matrix, 
it can be put on IST with the appropriate ISTCOM entries. 


DYNATAPE 


One of the major problems in moving a computing system from one com- 
er is the lack of consistency in the use of Input/ 
Output units at various installations. The 1/0 configurations and assign- 
ment of units to various functions have not been standardized in the com- 


empt to make the 1/0 of the BC TRY System as 
m of sub- 


puting installation to anoth 


puting profession. An att 
installation-independent as possible is reflected in the syste 


routines called the DYNATAPE system. These subroutines assign the 1/0 
unit functions to 1/0 units symbolically at execution time. The 1/0 units 
which are used at a given installation for systems, input, output, and 
Scratch can be assigned appropriately at a receiving installation without a 
great deal of technical knowledge and, most important, with the compila- 
tion of only one very short subroutine after modification of one block of 
Statements, involving no technical programming knowledge. 

Each type of unit function in the BC TRY System has been given a 
symbolic name. Thus, each 1/0 statement has one of these symbolic 
names associated with it. In BC TRY there are nine distinct functions for 
1/0 units. Each unit is assigned to the function by (1) the installation con- 
ventions and (2) the needs of BC TRY. The symbolic names and the asso- 


ciated unit functions are: 


Symbolic Name: Unit Function: 
MONIN Monitor input 
MONOUT Monitor output 

Monitor punch 


MONPUN 

MTSYS BC TRY System 
MTIST Intermediate storage 
MTDST Data storage 

MTSCI First scratch 

MTSC2 Second scratch 
MTTEST Program test 


umber of a unit is set by assigning the desired number 


The logical n 
In all components the values of the 


as the value of the symbolic name. 
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symbolic names are set in the main program. References to the units 
are carried through the calling sequence of each subroutine. The tables 
of symbolic unit names and their associated values are contained in an 
area of COMMON shared by all links in the System. The DYNATAPE system 
establishes the tables, checks them for consistency, redundancy, and 
undefined logical numbers, and restores the tables should a component 
program destroy the information in COMMON. 


Macro matrix and vector manipulations 


In order to provide the ultimate in flexibility of calculation within the Sys- 
tem the program SMIS, the symbolic matrix interpretive system, written 
by E. Wilson (see SHARE F1 BC SMIS) has been adapted to BC TRY. This 
program provides a complete array of matrix and vector operations with 
access to the IST through the IST subroutines. 

SMIS provides a flexible means of performing matrix operations 
under the control of a sequence of punched cards. A program run consists 
of reading the input deck and executing the operation designated by 
symbols which are selected by the user. The input deck of SMIS is com- 
posed of control, data, and remark cards. The control cards designate the 
operation to be performed on a given matrix. The matrix involved may be 
input on data cards or called from IST by reference to the file label of the 
file containing the data. 

Some 35 separate commands are included in SMIS. For example, 
the commands allow data sharing with IST, input and output from and to 
the monitor, scalar multiplication, finding functions of elements of а 


matrix, eigenvalue and eigenvector solutions, matrix addition, and matrix 
multiplication. 


Illustration: the program DAP 


The principles governing the functioning of the System may be illustrated 
with the first program needed in a BC TRY System use. This is the DAta 
Processor (DAP) mentioned above. Generally, the objective of this program 
is to get the raw score matrix stored in records on the Data Storage Tape 
(DST), to check it thoroughly, to compute the means and standard devia- 
tions of all the variables, and to put these values in files on the IST. 

How does the user "control" this program? The Use section of the 
"User's Description” in the Manual gives all the details of how to do the 
job. Briefly, one first commands the executive to call DAP by punching an 
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executive control card for DAP. This card has simply ‘‘/DAP2" on it. The 
slash is a control symbol for the executive GEP, and the DAP2 is the signal 
for the executive to give control over to the DAP program. 

Specific directions are given to the DAP program itself by punching 
certain alphanumeric symbols on the specific component control cards 
of DAP. If, as with most problems, a V-analysis is planned for not more 
than 120 variables and not more than a few hundred subjects, the control 
cards of DAP are quite simple. On the first card is punched the title of the 
problem, which will be printed on all output. The second card is the 
parameter card, on which is to be punched in specified columns only four 
items of control information positively required by DAP: the number of 
variables, the number of subjects, the number of columns on data cards 
devoted to the raw scores on each variable, and a number indicating 
whether there are any missing data. Following these two control cards 
comes the data deck, consisting of cards containing the scores of all the 
subjects on all the variables. 

It is possible to choose to input more information tha 
amount, and sometimes it is necessary. For example, if the scores on the 
cards are not in columns of constant width, one must input a format card 
telling DAP of the fact. If some variables on the data deck are to be skipped, 


an input format card is punched to tell DAP to do so. Usually one wants to 
e to each variable, a name to be printed on all out- 
er options such as 


n this minimal 


assign a symbolic nam 
put; if so, a VNAMS card must be input. There are oth 
inputting names of subjects or reordering the variables and reflecting 
some of them, but they are all extras; only the first two control cards are 
necessary in the usual case. 

So controlled, DAP executes its algorithms on the data and outputs 
the results in various forms. It automatically outputs the raw scores onto 
the DST together with certain necessary identifying information and 
parameters; these scores are in records on DST so that they can be read 
later by such programs as COR2, COR3, FACS, and RSCAT. It also outputs 
Some results of its work in files on the IST, the most important of which 
are the title of the problem, the means ard standard deviations of vari- 
ables, and the names of the variables; these files are extensively used by 
other programs. The printed output gives a printing of the options that 
have been taken, and it prints out in detail the values and entries it has 
stored in the different IST files. 

In general, then, DAP illustrates the potentialities of the computer 
System. For the purposes of cluster and factor analysis the computer 
does little computing work, but its role in preparing the data for other 


Programs is crucial. 


Chapter 4 


GENERAL ATTRIBUTES: CONCEPTUALLY DEFINED 
GROUPINGS OF VARIABLES VS. EMPIRICAL CLUSTERS 


collection of many variables can be reduced in number by grouping 
the variables into categories. The individuals observed on the vari- 
ables can then be compared by their composite scores on the categories 


rather than by the whole set of variables. What should be the nature of 
the groupings of the variables? The answer, beyond doubt, is that the 
variables composing each group, called the ''definers" of the category, 


must be similar in some way. But in what way? 


Conceptually defined groups of variables 


y has been that each category of 
" This expression means that the 
quality or similarity of 


Traditionally, the standard of similarit 
variables should be a “rational grouping. 
definers must all share some common abstract 
content. On this standard only those variables which were alike on the 
basis of some theoretical or social construction would be grouped together. 

Table 4.1 shows the five categories in which the 24 test variables of 
the famous Holzinger-Swineford study (1939), generally known as the 
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Holzinger-Harman problem (Harman, 1967), were cast by its investigators. 
The variables were selected so that there are at least four test samples in 
each of the five groupings. For tests in the first category, the spatial tests, 
most reasonable persons would agree that the work that the subjects, 
school children, go through in taking them shares the common property 
of imaginatively manipulating visual figures and shapes in a spatial way. 
The tests actually present no real things in space—only representations 
of them on paper. The five verbal tests, V5 to V9, all require dealing with 


TABLE 4.1 DEFINING VARIABLES OF THE RATIONAL 
GROUPS IN THE HOLZINGER-SWINEFORD STUDY 


Spatial tests 


RI Vis Visual figure completions 
F2 Cub Cube similarities 

F3 Fbd Paper form board 

F4 Loz Lozenge shape rotations 


Verbal tests 


V5 Inf General information 

V6 Cmp Paragraph comprehension 
v7 Snt Sentence completion 

V8 Wel Word classification 

v9 Wmn_ Word meaning (vocabulary) 


| Speed tests 3 


S10 Add Addition 

S11 Cod Code substitution 

S12 Cnt Counting groups of dots 

S13. 566 Straight or curved capitals discrimination 


Memory tests 


M14 Wrg Word recognition 
M15 Nrg Number recognition 
M16 Еге Figure recognition 
M17 Wn Object-Number recall 
M18 Nf Number-Figure recail 
M19 Fw Figure-Word recall 


Mathematical-ability tests 


N20 Ded Deduction 

N21 Puz Numerical puzzles 

N22 Rsn Problem reasoning 

N23 Ser Series completion 

N24 Ari Woody-McCall mixed fundamentals, form | 
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previously learned ideas or things expressed verbally. The four speed tests 
all require ‘mental speed" in manipulating easy contents, the elements 
of which were quite familiar to all the children. The six memory tests fall 
into two categories: the first three are tests of memory that require only 
recognition of materials just learned in the test situation; the second three 
are also tests of memory of just learned materials, but they test memory 
by recall. Finally, the last five tests all seem to sample some form or other 
of mathematical ability. 

Forming such composites in most problems has many practical short- 
comings and ambiguities. First, any classification that one might claim to 
be meaningful could be rejected easily as absurd or unreasonable by some 
argumentative authority in the field. Also, one might have some reserva- 
tions about a particular grouping, because (as in the Holzinger-Swineford 
study) the different definers of the group may have so much diversity in 
other respects than the common property abstracted out in the group that 
the extraneous noncommon elements might completely overshadow the 
common. A third and more disastrous difficulty is the discovery that one is 


ection of variables for which it is not possible to devise a 


dealing with a coll 
mmodate 


set of rational categories or that categories can be found to acco 
part of the collection but not the rest of it. 


Empirical clusters 


d together to form a composite must be shown objectively 


Variables groupe 
ent from the variables 


to be similar; they must also be shown to be differ 
of other composites. These properties of “within-group similarity, between- 
group difference" distinguish empirical clusters and factors from tradi- 
tional rational categories. 

The completely objective feature that describes the similarity of the 
definers of a cluster and their difference from other clusters is the prop- 
erty of collinearity. This term means simply that the definers “Та! on the 
same line,” i.e., are collinear. Generally, collinearity is defined by the line 
graph of the correlation coefficients of two variables with all the variables 
in the study, their correlation profiles. Collinear variables have the same 
profile of correlations. The phenomenon of clusters of collinear variables is 
illustrated here in three very different studies. Clusters of collinear vari- 
ables have two objective characteristics of similarity: they correlate posi- 
tively with each other, and they follow the same pattern of correlations 
with other variables. They also are objectively different from other clusters 
of collinear variables because their common correlation profiles have a 


different shape from that of other clusters. 


The degree of collinearity of the correlation profiles of any two vari- 
ables can be measured objectively by a special index of collinearity (see 
Chap. 12), which measures the degree to which the correlations of two 
variables are consistently proportional across all the other variables of the 
study. The index P? is 1.00 when all their correlations with the other vari- 
ables are the same; it is .00 when their correlations vary from each other 
іп an unsystematic way; and the square root P is — 1.00 if their correlation 
profiles are mirror images. Defining variables of empirical clusters reveal 
within-group similarity from the fact that the P? values between them 
approach 1.00, but they show between-group differences in collinearity 
because their P? values with the defining variables of other clusters are 
considerably less than unity. 

Cluster analysis begins, essentially, with a comparison of variables 
using the index Р? and defines clusters of variables by finding subsets of 
variables having values of P? within the subsets. On the other hand, factor 
analysis methods result in the definition of dimensions or factors without 
explicit reference to clusters or the index P?, However, the mathematical- 
geometric model of the general variability in a set of variables, as calculated 
by a factor analysis, displays the collinearity of similar variables in a way 
equivalent to the procedures using P?. After the variables are factored, 
they can be plotted in a spherical diagram (SPAN in BC TRY) or space. 
Variables with collinear correlation profiles will be collinear in this space; 
they are represented as points in the space lying on the same vector line 
drawn from the origin in the spherical diagram. We will turn to these rela- 
tionships again when we take up factor analysis in detail. 

What is exciting about locating collinear clusters objectively is the 
sense of discovery in doing so. The conclusion that collinear variables must 
be sampling the same domain of determinants of individual differences 
is a natural result of inspecting correlation profiles. In many problems, 
collinear clusters often turn out unsuspectingly also to be meaningful 
composites, so that one may easily speculate on what the substrate of 
causal components may be. In other cases, the causes may be obscure. 
For example, if the old saw about British villages were true that the vari- 
able “number of old maids" is highly inversely correlated with “number of 
field mice in the meadows that surround the villages," then these two vari- 
ables might (after “optimal reflection”) enter a collinear cluster, but it could 
take a Darwin to figure out the rational ecological connection. The imagina- 
tive cluster analyst might have solved the problem by including “number of 


cats," which, no doubt, would be collinear with number of old maids and 
number of mice. 
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Collinear clusters in the Holzinger-Swineford study 


In Fig. 4.1 the high degree of collinearity of the correlation profiles of the 
defining variables of the first four rational groups of Holzinger-Swineford 
tests is easily seen (for simplicity, mathematic ability is left out here). 
The top chart presents the profiles of the four spatial tests. For example, 
the Loz line is simply a plotting of the line graph of the successive correla- 
tions of the Loz test with the first 19 variables of the study (see Table 4.1 
for abbreviations of tests and their meanings). The correlations are read 


Psychological tests by rational domains 


Spaciol (F) Verbal (V) Speed (S) Memory (M) 7) 
be 
FL F2 F3 F4 М5 Уб v? v8 v9 510 511 512 513 MI4 MIS МІ6 МІ? MIB М9 
А ds A i | 
Lor Comp Add Cnr ТІГІ Fro Nf 
- int Cod Scc Nrg Wn Fw | 
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Correlation coefficient 7 


memory 
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Collinearity of rational composites and empirical clusters in the Holzinger problem. 
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directly from the correlation matrix. At the top of the figure the tests are 
arrayed as successive points on a common abscissa, identified by their 
names, and grouped by the four rational categories. In general, all four 
space tests follow the same line; hence they are collinear. Even though 
one of them, Cub, departs from the others a bit, it preserves the same 
pattern. The second chart shows the profiles of the five verbal tests, all 
showing high collinearity (with a trivial departure of the Wcl tests) but a 
different line from that of the preceding four space tests. In the next chart 
are the four speed tests with their own distinctive collinearity (with some 
deviation of Scc). The final group of six memory tests is not sharply collinear. 
None, however, would be classed with the first three groups. They are, in 
fact, collinear in respect to these features: all generally reveal positive 
correlations with all tests, and each appears to correlate higher with tests 
of its own memory category than with tests in other rational groups. 

Generally, then, when Holzinger and Swineford selected tests for 
their four rational groups they did in fact also select collinear clusters. 
Each group appears to sample a special domain of causes of individual 
differences, and the domains seem to be different from each other. The 
particular selection of tests was based on findings from an earlier project 
called the Spearman-Holzinger Unitary Trait Studies and from other 
earlier researches in which space, verbal, speed, and memory tests identi- 
cal with, or similar to, those in this Holzinger-Swineford study were exam- 
ined. The test selections were therefore expressly designed on the basis 
of prior knowledge to prove that the kind of collinearities seen in Fig. 4.1 
would emerge in a new sampling of children. 

The important scientific question is this: Without knowing the rational 
classes, can we objectively and empirically sort the 19 tests, without prior 
knowledge, into the four rational groups? The answer is that we can do so 
by objective cluster analysis methods using the objective index of collinear- 
ity P2, discussed above. At this point we need to demonstrate that the 
four collinear clusters do in fact reveal high within-group P? values and 
somewhat lower between-group values. How this index is used with other 
procedures to discover the four empirical clusters that turn out also to be 
meaningful rational classes will be developed in the chapter on key-cluster 
factoring. 

The average /?? values between the sets of definers of the four 
rational groups are shown in Table 4.2. For the four spatial definers the 
congruence of their correlation profiles, so clear in Fig. 4.1, is seen in the 
table to have a high average P? of .94. The next set of five verbal tests has 
an even higher value of .97, certainly confirmed in their graph. The speed 
and memory groups have P? values just under .90. These within-group 
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TABLE 4.2 AVERAGE P? VALUES BETWEEN THE 
SETS OF DEFINERS OF THE FOUR RATIONAL 


CLUSTERS 

| Spatial | Verbal | Speed | Memory 
Spatial 94 77 73 77 
Verbal 77 97 .66 Vil 
Speed 73 66 287 E 
Memory 77 71 76 .89 


Values are substantially larger than the between-group values listed in 
cells off the diagonal. It should therefore be clear that if we set as a criterion 
that the definers of a given cluster must mutually show P? values above 
-80, then an empirical cluster selected by this criterion will include collinear 


definers. 


Collinear clusters in the social-area study 


In order to illustrate the generality of this crucial concept of collinearity, 
the empirical clusters discovered in two quite unrelated studies of group 
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differences are presented here. The first is a social-area study of the 
metropolitan San Francisco Bay Area (Tryon, 1955), in which the objects 
are more than 300 large neighborhoods (census tracts) observed in the 
base year of 1940. The Census Bureau groups census-tract characteristics 
by the three rational categories that were not intended to be the basis of 
composites, namely, population, occupation, and home variables. The 
neighborhoods were scored on 33 of these variables: 8 population, 13 
occupation, and 12 home characteristics. 

An empirical cluster analysis of the correlations among the 33 vari- 
ables yielded the three salient collinear clusters whose definers are listed 
in Table 4.3 and whose correlation profiles are graphed in Fig. 4.2. When 
one looks at the top chart of the first group of five variables presented in 
Fig. 4.2 and reads the contents of these five definers from the table, it 
should be apparent that the general attribute measured by a composite 
score on them clearly refers to what is commonly recognized as the 
socioeconomic level of the neighborhoods. 

The middle chart of the figure shows the correlation profiles of the 
second cluster, a strikingly collinear subset which, from Table 4.3, clearly 


TABLE 4.3 DEFINING VARIABLES OF 
THE EMPIRICAL CLUSTERS OF THE 
SOCIAL-AREA STUDY (GROUP 
DIFFERENCES) 


Socioeconomic level 


Mm  Managerial-professional, males 
Uc  Undercrowded 

Df Female domestics (living in) 
Om Own account males 

Co College education 


Family life 


Oo Owner-occupied 

FI Large families 

Fd  Family-detached homes 

Uf  Nonworking females (housewives) 
Am Older-aged males 


Assimilation 


Sm Skilled males 

Nw Native-born whites 

F Females 

Fe Foreign from northwest Europe 
Wf  White-collar females 
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refers to the familiar suburban-urban dimension among metropolitan 
neighborhoods. A high score on a composite of these definers identifies a 
suburban neighborhood, one characterized by a high level of family life. 

The bottom chart is composed of two subclusters (one in broken 
lines) sufficiently collinear to collapse into one cluster. The rational con- 
ceptualization of this cluster is apparent if it is noted that a neighborhood 
with a low score on a composite of these five definers is characterized by 
unskilled workers, nonwhite and foreign, foreign from non-Protestant 


Europe and Asia, and blue-collar working women—clearly the unassimilated, 


Segrated minorities. A high score on this cluster is thus identified as 


assimilation. 
The collinear clusters here sample t 
nents that differentiate the neighborhoods. These three general domains 


меге discovered by applying purely objective criteria that do not depend 
on theoretical constructs. Nevertheless, the results do reveal that collinear 
Subsets of variables are meaningful dimensions and, incidentally, are those 
Which social scientists have theoretically speculated about a priori for some 


years (Shevky and Williams, 1948). 


hree domains of basic compo- 


Collinear clusters in the voting-attitude study 


The second illustration comes from an investigation of the attitudes of 
Small neighborhoods (precincts) as revealed by their voting on election 
issues in the city of San Francisco, 1954. In this study, a random sample of 
200 such neighborhoods, drawn from more than 1,000 that composed 
the full supply, were the object of the analysis. The variables on which they 
Were observed were 31 city and state propositions, including the vote far 
governor. After reading all the propositions and studying the preslection 
booklet giving arguments pro and con, it is still difficult to соте up with 
any convincing set of rational categories that would be the basis of struc- 
turing a reduced set of composites representing general attitudinal 


attributes of these neighborhoods. 


An objective cluster analysis of the intercorrelations between them 


readily revealed, however, three salient clusters. The definers of them 
are listed in Table 4.4, and their correlation profiles are shown in Fig. 4.3. 
In the top chart of Fig. 4.3 the four definers of the cluster described there 
are best understood from conceptualizing the neighborhood with low 
Scores on them, i.e., with a low percent voting for the measures. Such a 
Benerally low-scoring neighborhood is against (freely interpreting the 
Names of the variables in the figure) fringe benefits for city hospital 
employees, against aid to needy aged, for the Republican for governor, 
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FIGURE 4.3 


Collinearity of empirical clusters of election measures in the study of voting 
attitudes of precincts. 


and against the state's going into the business of building parking facilities. 
The general attitude domain sampled by these four definers is quite 
clearly the dimension of liberal vs. conservative, Democratic vs. Republican, 
statism vs. rugged individualism. This cluster deserves the label ‘‘political."’ 

The middle chart of profiles includes five bond issues for certain 
community enterprises. Since these enterprises were to be funded from 
property taxes, those neighborhoods with a higher vote in favor of them 
are neighborhoods of people readier to increase taxation for them. Those 
opposed are less enthusiastic and are composed more heavily of property 
owners. This is the property taxation cluster. 

The bottom chart includes a cluster with three definers, high scores 
on which are earned by neighborhoods that favor exempting various 
institutions from certain taxes: churches, welfare institutions, and colleges 
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TABLE 4.4 DEFINING VARIABLES (ELECTION 
MEASURES) OF THE EMPIRICAL CLUSTERS OF THE 
VOTING-ATTITUDE STUDY OF PRECINCTS (GROUP 
DIFFERENCES) 


Political, P. 


HsEm Hospital employee benefits 
NdAg Needy aged aid 

DmGv Democrat for governor 

Pk Parking facilities by state 


Taxation, T 


HsBn Hospital bonds 
ExBn Exhibition hall bond 
AgBn Aged home bonds 
ScBn School bond 

VtBn Veterans’ bonds 


Ethnic, E 


ChEx Church tax exempt 
WIEx Welfare institutions (religious, etc.) tax exempt 


CoEx College tax exempt 


that are not already exempted, e.g., parochial schools. The strong inference 


here is that the attitude domain sampled by these three measures reflects 
a dimension that at one end favors ethnic minorities and at the other end 


is unsympathetic to them. This appears to be an ethnic cluster. 


Rational composites and empirical 
clusters as domain samplings 


In the Holzinger-Swineford study each of the rational categories of intel- 
lectual characteristics—-spatial, verbal, speed, and memory—demark four 
rather large domains from which the selected tests are in each case only 
One finite sampling. For the spatial category, for example, ina investigators 
Could have utilized other spatial tests that could have been just as accept- 
able as definers of this domain as the four actually used. Also, in the verbal 
domain, in place of the five actually used, the investigators could have 
Selected another set from the plethora of tests that psychologists have 
invented and called tests of verbal ability. Similarly for speed and 


memory abilities, there is a vast domain of possible tests in each rational 


Бгоџр. 
In each category not only are the tests 
rational domain, but from Fig. 4.1 virtually all of them are also seen to be, 


seen to be samples from a 
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in addition, samplings from а collinear domain, more simply known as a 
cluster domain or, in orthodox factor analysis, a factor. With time, zeal, 
and a large budget one would be able to augment each of those categories 
with tests which, in addition to rational similarity, would have the same 
correlation pattern as those actually used in the study. 


Validity of a cluster score as a measure of a domain 
The point is emphasized that test samples are drawn from large domains, 
rational, or collinear, or both, in order to accent the fact that any composite 
Score of a subject on a finite sample is „о! necessarily the exact score ће 
would earn if it were based on a more extensive set of equally acceptable 
test samples drawn from a domain. A subject's observed, fallible cluster 
Score inevitably suffers from limitations of domain sampling. Scores can- 
not be taken at face value: one must know how much rational or cluster- 
composite scores suffer from this type of sampling error. The degree of 
the limitation is quantified by the value of the correlation coefficient of the 
observed scores with domain scores that would be earned by the subjects 
on an indefinitely large battery of tests, all equally representative of the 
domain. Such a correlation coefficient is called the ‘‘domain validity coeffi- 
cient" of the observed score. The expression ''validity" carries its usual 
рзусһотеігіс meaning, namely, the degree to which individual differences 
in fallible scores reflect individual differences in “true” scores—in this 
case hypothetical scores made by the subjects on an indefinitely large 
battery of tests drawn from the given domain. In orthodox factor analysis, 
the expression “accuracy of factor estimates” is also used in place of 
"domain validity of a cluster or factor score" (Harman, 1967, pp. 341ff) 
though the two expressions mean precisely the same thing. 

How can such a validity coefficient be calculated, seeing that it is 
impossible to expose a subject to an indefinitely large battery of tests? 
Actually it cannot, but it can be estimated from available knowledge of 
the intercorrelations between observed definers of the cluster. The esti- 
mation formula is developed in Chap. 12. No assumptions are involved in 
the formula, only the definition of the Score on a domain, or factor, aS 
being composited from scores on many variables collinear with the existing 
lot. The relative contribution of each definer of a cluster to the validity 
coefficient of the composite score is indicated by the size of the definer's 
communality, an important index that is the main topic of the next chapter. 
To secure good estimates of the communalities of the variables in order 
to find their contributions to validity is one of the reasons why cluster and 
factor analysis is preoccupied by the communality problem. 
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The domain validities of observed composites, whether rational or 
cluster, are computed routinely in the Cluster Structure Analysis (CSA) 
Program of the BC TRY System. In the Holzinger-Swineford study, the 
validity coefficients of the four rational categories, consisting of their full 
Sets of definers, are listed in the first row of Table 4.5. The second row 
gives the validities of the most collinear set, i.e., after deleting the least 
collinear single variable from each category (these are the variables 


8raphed in broken lines in Fig. 4.1). 


TABLE 4.5. DOMAIN VALIDITIES AND RELIABILITY COEFFICIENTS OF COMPOSITE 
SCORES IN THE HOLZINGER-SWINEFORD STUDY 


Spatial | Verbal | Speed | Memory 


Domain validities: 


Full sample .83 .96 .91 .88 
Most collinear set .82 .96 .88 .88 

Reliability coefficients (internal consistencies): 
Full sample .69 .92 .83 77 
67 .92 727 ‚77 


Most collinear set 


Note that in Table 4.5 the domain validity of cluster scores based on 
the five verbal tests is the very high value of .96, signifying that individual 
differences among the subjects in their cluster scores would match indi- 
Vidual differences among them almost exactly if they were to be exposed 
to the impossible hardship of taking an indefinitely large number of verbal 
tests collinear with the present lot. In short, it would not be necessary to 
expose subjects to such a hardship: the results on the present verbal 
battery of five tests are approximately equal to those which would be 


earned on a full domain. б 1 
In contrast, the composite scores of the subjects on the four spatial 


tests are a bit short on validity, their coefficient being only .83. It would be 
Necessary to administer to the subjects additional spatial tests collinear 
With the present set of four if a really valid representation of individual 


differences was wanted. on | 
One advantage of knowing the degree of collinearity of the different 


definers of а domain is that one can often weed out the least collinear 
Without loss of validity of the total score. To illustrate, from the second 
line of Coefficients in Table 4.5, it can be seen that by eliminating the least 
Collinear test in each category the validities of the total score on the smaller 


Composites are reduced only trivially. 
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Reliability coefficients of rational and cluster scores 


A more familiar index of how much a composite score is subject to error 
because of limitations of test sampling is its reliability coefficient. The 
reliability coefficient of a cluster score, or of any composite, is defined as its 
correlation with a second composite consisting of definers ‘‘strictly com- 
parable” to the existing first set. When the strictly comparable set is 
defined by variables collinear with the observed definers, the correlation 
is called an "internal-consistency'" reliability coefficient. The internal 
consistencies of the Holzinger-Swineford clusters are given in Table 4.5. 
These values are simply the squares of the validity coefficients given above 
them (see Chap. 12). In psychological measurement one likes composites 
to have reliability coefficients well up in the .90s. By this standard, it will 
be seen that only the verbal cluster score has satisfactory internal con- 
sistency. However, the internal-consistency coefficient is a lower bound 
of the reliability coefficient of a composite. If a strictly comparable set of 
definers of a composite is not only collinear with the existing definers but 
also a set of "parallel forms” or repeated measures of the existing definers, 
then the estimated correlation between the observed composite and the 
parallel composite will necessarily be higher than the internal-consistency 
reliability. This higher coefficient is termed the "'parallel-form reliability 
coefficient" or the "stratified reliability" of a composite (Tryon, 19574, 
eq. 32). Here are some examples. Since Holzinger and Swineford report 
the parallel-form reliability coefficients of the individual definers, we can 
compute the parallel-form reliabilities of the first three cluster composites. 
The respective values are .86, .93, and .93, which are quite respectable 
values compared to their internal consistencies of .67, .92, and .77 given 
in the bottom row of Table 4.5. 

Notice that we have now defined a third type of domain, the stratified 
domain, composed in this case of an indefinitely large number of parallel 
forms of the defining variables of the observed composite. The stratified 
domain is more restricted in scope than a collinear cluster domain, which 
in its turn is more restricted than a rational domain. It is important to keep 
in mind the domain from which an observed cluster score is a sampling. 
For example, since the internal-consistency reliability of the spatial cluster 
Scores is only .69, the individual differences in these scores rather poorly 
represent differences in scores on a domain consisting of a large number 
of collinear spatial tests. On the other hand, the parallel reliability of .86 
signifies that the observed cluster scores are quite reliable measures of à 
more restricted domain composed of a large number of tests that are 
parallel forms of the four definers, Vis, Cub, Fbd, and Loz. The only way 
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to get a cluster score that is a good measure of a space cluster or factor not 
confined to measures of these restricted four space types is to add more 
collinear diversified tests to the observed cluster sample, enough to push 
the internal consistency of a new composite well up beyond .90. 


Correlations of a cluster domain with the 
individual variables (factor coefficients) 


It is helpful to estimate the degree of correlation of each variable of a study 
with the clusters (or factors) in the study. This estimate is known as a 
"factor coefficient" or ‘factor loading" (since they were first used in 
connection with the procedures of factoring, to be treated in a later 
chapter). For any given cluster domain the array of its factor coefficients 
with the full set of variables is a simple linear function of the mean correla- 
tion of its definers with the variables (see Chap. 12). We do not really have 
to look at the exact values of these factor coefficients because the mean 
Correlation profiles graphically display monotonically the factor coefficients 
Of, say, all the 19 variables on each of the four cluster domains of the 
Holzinger-Swineford study in Fig. 4.1. In short, looking once again at the 
mean rise and fall of the profile of correlations with the various tests will 
help build up a conception of each cluster domain. Thus, for the spatial 
domain, its factor correlations on the five verbal tests are only a shade 
lower than on its own four spatial definers. This fact leads to the possible 


Conclusion that verbal components, and not “pure” spatial dispositions, 


may be utilized by the children in solving the problems in the tout space 
tests. Said another way, the position can be taken that the spatial domain 
Or factor appears in some way to involve verbal elements. Following this 
line of analysis for each of the domains will give considerable insight into 
the rational construction of the four general attributes measured in the 
Holzinger-Swineford study. Since the factor coefficients are more system- 
atically considered in a later chapter on cluster structure analysis, we 


do not treat them in detail here. 


Correlations between the cluster domains 


More globally, the rational character of each cluster domain can be better 
Understood if its correlations with other clusters can be estimated free 
from limitations of sampling. Such correlations are termed in orthodox 
factor analysis “correlations between factors." Their relative magnitude 
сап be inferred from the general level of the correlation profiles between 
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the definers of the different clusters. For example, in Fig. 4.1 the sets of 
definers of all four clusters generally have nonzero correlations at about 
the same positive level, the actual correlations between the domains all 
being of the general order of .50 or .60. This general positive level of corre- 
lation, termed g in orthodox factor analysis, means that the domains of 
space, verbal, speed, and memory appear mutually to embrace to some 
extent the same welter of causes of individual differences among the 
subjects. Contrast this finding with that in the social-area problem, where 
from Fig. 4.2 it is obvious that whereas the domains of socioeconomic level 
and family life appear to be expressions of independent causal matrices, 
the socioeconomic and assimilation domains do, on the other hand, seem 
to share some common elements or source of differences among 
neighborhoods. 


Chapter 3 


COMMUNALITIES OF THE VARIABLES 


‘ae communality of a variable is a number between .00 and 1.00 that 
measures the generalities of individual differences in the variable. 
ive verbal tests of the Holzinger study in 
Fig. 4.1 indicate that their communalities should be rather high because 
of the relatively high correlations between the variables and because of 
their generally positive correlation with all the other variables of the study. 
This pattern of correlations means that how any one of these tests ranks 
individuals is general; i.e., the ordering of individuals on the variable 
resembles the ordering on the other four verbal tests. The communalities 
of these verbal tests is of the order .70, except for the one test, word 
Classification, which has a lower correlation profile than the others, on 


the order of .50. 

The fact that these С 
tests sample other kinds of varia 
they jointly sample. The gap betw 


the “uniqueness” of each, 1.е., 
About three-tenths of the variation among individuals is not shared with 
but is a unique property of each; about five- 


The correlation profiles of the f 


ommunalities are not 1.00 signifies that the five 
tion besides the common verbal domain 
een .70 and 1.00, namely, .30, represents 
variation that is not common to them. 


any of the other variables 
tenths for the word classification test. 
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Definition of the соттипа у of a variable 


There are two alternative definitions of the communality 1: of an observed 
variable. Both definitions are useful. In order to define its communality 
we hypothesize that the observed variable is only one of many possible 
definers of a collinear domain of variables—like the different definers of 
each of the four clusters of the Holzinger study. Specifically, the domain 
score D, for the observed variable V; is the hypothetical composite 


D; 2 Vi + Vi? + VP 40" (5.1) 
where superscripted V's are hypothetical values on many other vari- 
ables collinear with the observed variable. Thus, a domain score is a hypo- 


thetical construct, but the figures of the last chapter demonstrate that 
domains of collinear variables are not only possible but common. 


Communality as the common variance, 
or predictable variance, of a variable 


The first definition of communality is the squared correlation of the vari- 
able and the domain of the variable 


№ = orig, (5.2) 
The square of a correlation between any two variables is called the ''index 
of determination” of one of the variables by the other; it is the proportion 
of variance of one of them predicted from the other. Therefore, by Eq. (5.2), 
the communality is that portion of the variance in the observed variable 
which can be predicted from the general domain. In the case of the verbal 
tests, the communality of .70 means therefore that about 70 percent of 
their observed variance is general or common variance, a variance shared 
with, and predictable from, a general attribute sampled by other variables. 
The variance of the verbal tests that is not common, i.e., residual, is about 
30 percent of their observed variances. It is symbolized as из, which is 


и? = 1— hk (5.3) 


This so-called “unique variance” is to be explained by components that 
are not shared by any of the other variables of the study. 


Communality as a correlation coefficient 
with a single collinear variable 
The second definition of the communality of a variable is more down to 
earth. It is a special kind of correlation coefficient, namely, the correlation 
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of the observed variable with another set of scores on a second hypotheti- 
cal variable V’ that is exactly collinear with V. Thus, 

№ = гуу; (5.4) 
where the correlations of the second variable V’ are perfectly collinear 


with the observed array of correlations. 
Table 5.1 demonstrates with data from the Holzinger, social-area, and 


voting-attitude studies that variables commonly do have at least one other 
"reference'' variable in the study with which they are highly collinear. The 
column headed Holzinger, for example, gives the relevant facts for each 
of the full set of 24 variables of the Holzinger problem. Opposite the ordinal 
number of each of these 24 variables is shown its highest Р? value with 
any one of the other 23 variables. For example, the highest P? of variable 1 
With another variable is .97. Note that for every one of the 24 variables the 
highest P? is .90 or more. The average, shown at the foot of the table, is 
-95. And similarly, for the full set of 33 variables in the social-area study: 


their highest collinearities average .93. 


Where and how communalities are used in cluster and factor analysis 


The need for an accurate determination of the communality of each vari- 
able in cluster or factor analysis can be demonstrated by a brief summary 


Of the places in which it is used. 


TABLE 51 IN THREE STUDIES, THE HIGHEST VALUE OF COLLINEARITY Р? OF 


EACH VARIABLE WITH A REFERENCE VARIABLE 
Study Study 
Vari- Vari- 7 ғ 2 
able | Holzinger | Social Voting able | Holzinger| Social Voting 
1 97 99 ‚94 18 .91 .97 .97 
2 96 .97 .95 19 91 .97 .95 
3 .94 .97 .95 20 97 97 .95 
4 .97 .98 93 21 .95 97 .95 
5 .98 .95 .99 22 .97 .95 .98 
6 .99 .97 .94 23 .97 .96 .98 
7i .98 .92 .92 24 .93 .95 .98 
8 97 97 93 25 94 98 
9 .99 .92 .98 26 99 .93 
10 .90 .92 .91 27 .93 96 
1 .93 .89 .97 28 .94 98 
l2 .90 .93 .93 29 ‚92 99 
13 .95 .97 .95 30 91 .99 
4 94 95 „98 31 73 96 
15 .94 .96 .90 32 .68 
16 94 96 .94 33 81 
17 91 92 93  |Mean 95 93. 95 


Communality as an index of 
the generality of a variable 


Since communality describes the generality of individual differences ina 
variable, and since the main objective is to form composites of variables 
that measure general ways of ranking individual differences, it becomes 
important to discover from a variable's communality how much the vari- 
able can contribute to the formation of general clusters. In the first place a 
variable's usefulness as a definer in a cluster depends on the size of its 
communality, since its contribution to the size of the domain validity of the 
cluster and to the internal-consistency reliability of the cluster is a sensi- 
tive function of the size of its communality (see Chap. 12). 

The magnitude of the communality of a variable is a function of its 
correlations with other variables of the study, and if these correlations are 
.00, the communality is .00; the variable will have factor coefficients of .00 
with all clusters and factors and should therefore be discarded. Іп con- 
ducting a study that encompasses a very large number of variables (as in 
BIGNV analysis), the first step is therefore to estimate the communalities 
of all the variables and delete those which have trivial values. Discarding а 
variable means not that it does not have important unique variance but 
that it is irrelevant to the objective of cluster and factor analysis, which is 
to discover general attributes, not unique characteristics. 


Communality as an indication of how much variance 


is to be described by a cluster or factor analysis 


Since our exclusive interest is in general variance, we are really concerned 
only with that portion of variance which is common variance. Thus, when 
the factor coefficient of a variable is low relative to its communality, We 
know that other clusters or factors are important correlates of the variable. 

The contribution of each cluster or factor to the variance of a variable 
is discovered by the process of ‘augmenting’ its factor coefficients, а 
matter we consider in Chap. 6 and one that requires knowing the value 
of its communality. Suffice it here to say that from the augmented factor 
coefficient we can assess the proportionate degree to which a cluster 
domain or factor determines the common variance. Indeed, it is standard 
practice to deal exclusively with augmented factor coefficients in some 
analyses where one ignores the unique variance of all the variables and 
deals only with this common factor variance. 
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Communality as an index of sufficiency 
(reproducibility) in factoring 


In Chap. 6 we consider the matter of how many clusters or factors are 
needed, i.e., the sufficient number of composites to which one may wish 
to reduce all the variables. At this point, however, it should be evident that 


one would not want more clusters or factors than the number minimally 
unt for all the common variances of the 


derived from factoring the correla- 
ns of the total variances of all 


Sufficient to reproduce or acco 
variables. The successive dimensions 
tion matrix ''take out" decreasing proportio 
the variables. Thus, before the factoring procedure is undertaken, one 
first estimates the communalities of all the variables and sets up the over- 
all sum of the communalities across all variables, that is, 21", as a criterion 
of when to stop factoring. To use this criterion, one computes after each 
dimension the cumulative proportion of Xh? accounted for up through 
that dimension. When а salient portion, say 95 percent or so, of the esti- 
mated communalities has been reproduced, or exhausted, the factoring 


is terminated. 
Communality as a diagonal value 
in the correlation matrix 

matrix gives the array of its correlations 


ell on the diagonal should contain the 
hat does not mean the correlation of 


For each variable the correlation 
with the other — 1 variables. The с 


variable's “ве -соггејаноп,"' а term tl i J 
the scores with a duplicate series but its correlation with another variable 


that is somehow a replica of it. Such another variable is one exactly col- 
linear with it, a variable that samples exactly the same general attributes 
that it does. That correlation is, Py definition, гу, or its communality 


by Eq. (5.4). 


Estimating the communality of a variable 


as population means, standard deviations, 
irect calculation of the communality of a 
do not have its domain score 1), on 


Like other parameters, such 
and correlation coefficients, 4 


variable V, is not possible. Since we ; 
individuals, we cannot compute the communality by Eq. (5.2). Nor do we 


have a perfect second construct variable V! that is exactly collinear with 
it; therefore we cannot accurately compute the communality by the second 
equation, (5.4). The best we can do is to compute an approximation. 
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There has been a long disputatious history in factor analysis over the 
“problem” of how to estimate the "unknown communalities'" (Harman, 
1967, chap. 5; Thurstone, 1947, chap. 13). But since the concept is quite 
unambiguous on domain-sampling principles (Tryon, 19575), Eqs. (5.2) and 
(5.4), the matter reduces simply to using methods of solving for the com- 
munality estimates that come closest to satisfying either Eq. (5.2) or (5.4). 

No attempt is made here to present the detailed computing formulas 
used in estimating communalities. The derivative logic and formulas are 
presented in Chap. 12. The intention here is to show only how closely the 
estimates approximate good values. 

There are 11 different methods of securing approximations to the 
communalities by using programs of the BC TRY System. They are sum- 
marized in Table 5.2, where the communalities of the 24 variables of the 
Holzinger study as approximated by each method are listed. The 11 
methods are grouped into four main classes: those which compute esti- 
mates from (1) a single reference variable, from (2) subsets of reference 
variables, from (3) the full set of all the other n — 1 variables of the study, 
and (4) from the factoring procedures described in Chap. 6. 

The most obvious conclusion to be drawn from an inspection of 
Table 5.2 is that whichever method is used, the value of the estimate of 
the communality of a variable is much the same. One may well ask: What 
has all the controversy been about? The main reason for it is that until the 
methods could be programmed on the computer, few students of the 
problem had the patience and budget to try out all the methods on a wide 
variety of different problems to see how closely they matched. To compute 
and carefully check the results in Table 5.2 would have taken weeks on a 
desk calculator: it now takes less than 6 min on the computer. 

Now let us look at Table 5.2 with the aim of finally deciding which 
method gives the besi approximations to the communalities. In order to 
make that decision we need to know the ''true" communalities of the vari- 
ables. With them at hand, we can find out which method gives the closest 
unbiased estimates of the true values. Actually, with real data there are 
no such values as the “true” communalities of the variables—except in 
artificial problems (see Tryon, 19575). The situation is like asking: What is 
the “true” upper threshold of sensing auditory intensities? Even if the 
most elaborate, highly controlled instrumentation were used to measure a 
subject's upper limit of hearing on different experimental sessions, the 
results would vary. However, we might take the “true” limit to be the 
average of the separate limits. Similarly, we take as the “true” communal- 
ity of each of the 24 tests its average value over the 10 independent esti- 
mates (the method called “МОР В” і not included because it is itself an 
average of two other estimates in the table). 
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The average communalities are listed in the last column of Table 5.2. 
The decision of which of the 11 methods gives the closest approximation to 
the true values is based on discovering the degree to which each column 
of estimates in the table matches the average values in the last column. 
Itis not sufficient simply to find the product moment correlation coefficient 
between each column of estimates and the column of true values because 
if the relation were in fact curvilinear, the correlation would underestimate 
the relationship. Furthermore, for a given method it is possible for the 
correlation to be close to 1.00, and the absolute matching of estimated 
values with the true values could be seriously biased, e.g., by being con- 
stant underestimates or overestimates of the average values or by SyS- 
tematic compensatory underestimates at one level of average values and 
overestimates at another. We need a method of assessing not only the 
degree of correlation between estimates and the average values but also of 
the extent of constant or differential bias in the estimates. Such information 
is revealed (1) by computing and comparing the Pearson correlation coeffi- 
cient r and the correlation ratio 7 in order to assess linearity and (2), to 
assess bias, by calculating the regression line of estimated values on 
average values. Furthermore, we would not want to put all our eggs in the 
Holzinger-study basket but should include in this appraisal the findings 
from the other two studies. 

The results of testing each method to determine the degree of its 
matching of estimates to average values for the Holzinger, social-area, and 
voting-attitude studies are condensed in Table 5.3. Each value in Table 5.3 
is the average of three values from separate analyses of the three groups 
(see the technical note below). 

First of all, note that the 11 methods are bracketed under the four 
main classes, referring to the major kinds of computing formulas designed 
to estimate communalities, as in Table 5.2. For example, the first one 
includes two methods of estimating communality from a single reference 
variable, clearly referring to Eq. (5.4), namely, estimating h? from rj. y,» 15 
correlation with a second variable collinear with it. The second class 
includes four methods of estimation from a subset of the most collinear 
reference variables. In this case, the intent is to designate a domain D; 
of collinear variables from which a variable's communality can be estimated 
by гї p, from Eq. (5.2). The third group, of two methods, also stems from 
Eq. (5.2), the intention being, by multiple correlation procedures, to devise 
а best-weighted composite of all n — 1 variables that will best predict à 
collinear domain from which pari passu the variable's communality can 
be estimated under Eq. (5.2). The fourth group, of three methods, is an 
interesting development from Eq. (5.4). For each variable a second hypo- 
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TABLE 5.3 ESTIMATION OF THE AVERAGE COMMUNALITIES OF THE VARIABLES BY 
11 DIFFERENT METHODS IN THE HOLZINGER, SOCIAL-AREA, AND VOTING STUDIES 


Bias in Estimated Л? 


Correlation between | Mean Deviation d of 


Method of Estimating the the Average and Estimated A? from an | 24 
Communality A? Estimated h? Values | Average h° Value of 
T 7 .00 .50 1.00 
From a single reference variable: 
1. r with most collinear variable 80 .87 16 03 | —.07| .26 
2. Highestr .96 .97 Nb) .06| —.05| .28 
From a subset of the most collinear 
variables: 
3. Approximation В 94 æ |— 15| —.07| .00| .22 
4. Modified approximation B 97 .98 01 | —.01| —.02, .04 
5. Proportional fit on four 
97 98 = 10| —.05 00 | .25 


reference variables 

6. Proportional fit on nine 
reference variables 97 .98 -.01 01 03| .05 
From the n — 1 other variables of 


the study: 

7. Squared multiple R .95 .96 8|  .08| —.02| .28 
Ё 8. Quadratic formula .92 .94 02] .04| .05| .21 
rom factoring: 

9. Key hasten 96 97 —.13 | —.04 .03 | .20 
10. Centroid 97 98 — .06 | — .02 01| .09 
ll. Principal axes .98 әв |—.07|—.02| .0|.H 


thetical collinear variable V’ is constructed such that it has exactly the 
Same factor coefficients on each of the factored dimensions as its mate, 
Whence its communality is solved for directly by Eq. (5.4), that is, "rv. 
For each method reported in Table 5.3 the first two columns give the 
average correlation between its estimates and the average values over the 
three problems. Except for method 1, the linear correlations are above .90, 
most being around .95 or higher. There is no curvilinear relation between 
estimates and average values; the values for the correlation ratio n are 
only trivially higher than their corresponding linear correlations. In sum, 
each method except the first produces ranked communalities of the 
Variables closely corresponding to the rank orders of the average 


communialities. 
The story is somewhat di 


last four columns of Table 5.3. Loo 
Case where the average communality is the value .00, we see that methods 
eviation d from the average 


fferent with respect to bias, given in the 
king at the first column of biases, the 


1,2, and 7 give communalities whose average d 
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value of .00 is at least .16. These are methods that overestimate communali- 
ties. Methods that show negative deviations are those which underestimate 
the average communalities. In the second and third columns of biases, 
for average communalities of .50 and 1.00, over- and underestimation is 
generally less. Finally, the last column gives a numerical statement of 
overall bias, consisting of the sum of deviations at .00, .50, and 1.00, dis- 
regarding sign. Without going into detail, note that methods 1, 2, 3, B; 7, 
etc., do poorly overall, but two of the methods, 4 and 6, are virtually without 
bias at any of the three levels of communality, low, middle, or high. 

The method that wins the competition is the one that produces com- 
munality estimates that match best, in absolute magnitudes, the average 
communalities. There are, in fact, two of them: method 4, modified 
approximation B, or MOD B, and method 6, proportional fit on nine refer- 
ence variables, or PF(9). Of these two, MOD B is superior in the practical 
sense of taking much less computer time to compute than PF, a matter 
of some moment in problems, say, involving 120 variables. Note that both 
methods produce estimates that correlate .97 with the average values but 
in addition are almost without bias, giving absolute values almost identi- 
cal with the average. Method 1, highest r, is excellent for ranking variables 
by communality, but to secure good absolute values requires correcting 
for a severe overestimate in the lower ranges of communalities. 


Technical note: method of computing relations between estimates and average 
communalities by RSCAT 


The values of Table 5.3 are averages computed from the 11 correlation 
scattergrams of each group. These scattergrams give the relationship 
between the average values and the 11 estimates for each group. For 
example, in the Holzinger study the estimates and average values, as given 
in Table 5.2, were input as a data matrix to DAP2, the columns being 
treated as variables, rows as objects. Тһе correlation-scattergram prO- 
gram, RSCAT, was then called, wherein the option was taken to compute 
all scattergrams between a “соттоп variable," the average values, and 
the 11 other variables. Each scattergram gives both r and the n's, and the 
constants of the regression line of average values on the estimates, namely, 
the slope b and the intercept a at the value of .00. Thus, a is the estimate 
at a value of .00, .506 4- a at the value of .50, and b + a at the value of 
1.00. These values are those given in Table 5.3. Both the linear and curvi- 
linear regression lines themselves are provided in RSCAT, thus also per- 
mitting a visual check on linearity. 
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DISCOVERING SALIENT GENERAL DIMENSIONS BY 
KEY-CLUSTER FACTORING 


iscovering collinear clusters among 
purely objective, applicable by way 
f discovering the well-defined, 


ве chapter we discuss means of d 
of Api of variables. The methods are 
MBan omputer procedure but capable o 

ngful clusters introduced in Chap. 4. 
а are three primary objectives in 
dere from the variables of a study the mutuall 
ku of the clusters; (2) to discover the minimal or salient number k of 
un sufficient to reproduce all the intercorrelations among variables 
a communalities of the variables; and (3) when there are more 
inm EO variables than the minimally sufficient number, to provide 
Жайы with which to select the most nearly independent (uncorre- 
ane clusters on which to score individuals in object cluster analysis, 

5 that will maximally differentiate the score profiles among the 


individuals. 


key-cluster analysis: (1) to 
Пу collinear variables defining 


Key. Ж 
Y-cluster factoring: а special case of the general principles of factoring (inde- 


he i 
" dent dimensional analysis) 
s from the procedures of factoring, of 


Th 
е term “factor analysis" come 
different types of factoring are, how- 


Whi 
hich there are many variants. The 
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ever, all special cases of the general method of independent dimensional 
analysis (Tryon, 1959). Procedurally, in principle, the simple logic of the 
general method is as follows. Starting with the matrix of the correlations 
of the variables, with the communalities in the principal diagonal cells, the 
first dimension, or factor, is defined as a weighted composite of the уагі- 
ables. This composite is taken to be either a subset of them or the total 
set of them with a designated pattern of weights attached to the variables. 
Then, by a special form of adjustment to be described later, scores on this 
first factor composite are partialed out of each of the correlations in the 
matrix. The resulting matrix is called the ''first factor residual matrix." A 
second dimension is then defined by a newly weighted composite of the 
variables, scores on which are partialed out of the first residual matrix, 
the result of which forms a second residual matrix. The procedure is con- 
tinued until a final residual matrix is formed of entries that are of trivial 
magnitudes. 

А matrix of trivial residuals is one in which the values are so small 
that they are not considered worth defining a new dimension on. In most 
problems there is not much difficulty in deciding on when to stop factoring 
by inspecting the residuals. Thus, in the Holzinger problem, after four 
dimensions are extracted, even the largest of the fourth factor residuals 
would not lead to a fifth dimension on which one would reasonably wish 
to score the subjects. 

Another indication of how many dimensions to retain is simpler than 
scanning a large matrix of residuals: the proportion of the communalities 
accountable from scores on the dimensions provides a criterion of the 
sufficiency of the dimensions. The communalities represent the amount 
of common variance among the variables. As factoring proceeds, the ро” 
tion of the variables’ communalities accounted for by each dimension is 
computed. When a salient amount, say around 95 percent, is accounted 
for, it is usually not fruitful to continue factoring because any additional 
single dimensions would account for less than about 5 percent of the сот" 
munalities of the variables. Such a dimension would have very narrow 
generality. In the key-cluster factoring program CC5 of the BC TRY System 
called cumulative communality key-cluster analysis or, more simply, cc 
analysis, this statistic is kept track of. It has been programmed to quit 
factoring when 97.5 percent of the sum of all estimated communalities has 
been exhausted (although this criterion is under the control of the user 
of the program). 

Key-cluster factoring is a special case of general factoring principles- 
Each dimension in key-cluster factoring is defined by a collinear subset of 
variables. Well-known other special cases (to be treated in a later chapter) 
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ШЕ centroid and principal-axes factoring, which are alike in defining each 
dimension by the total set of all n variables with a specially selected pattern 
of weights attached to them. The methods differ from each other only 
with respect to how the weights are computed—centroid by simple sums, 
Principal-axes by least-squares weighting. One of the oldest special cases 
is “two-factor” or "'bifactor" analysis, in which the first dimension is 
defined by the total set of variables whereas the later residual dimensions 
are defined by subsets. Still another special case is that in which each 
dimension is defined by a single variable, a form variously known as 
"square root," "'diagonal," ог "pivot-variable" factoring. Confusion 
Caused by this plethora of factoring methods can be avoided by remem- 
bering they are all merely special cases of the general principles stated 


above: they are simply variations on the same general theme. 


The decision pattern of key-cluster factoring 


sional analysis can be thought of as 


The general method of multidimen 
' |t is the pattern of special deci- 


consisting of eight “regions of decision.’ 
Sions across these eight regions that delineates a particular factoring 


Method. We illustrate here the key-cluster pattern, making it concrete by 
Showing the application of it to the Holzinger study. To demonstrate clearly 
how widely applicable key-cluster factoring is, we will show how it works 
Out on the social-area and voting-attitude studies. 


The Holzinger study 


Reproduction of intercorrelations 
nsions is revealed, in one aspect, by 
n matrix. They are deemed sufficient 
trivial values. When the correlation 


The sufficiency of the factored dime 
how well they reproduce the correlatio 


а5 soon as a residual matrix contains 
of two variables is accounted for by a set of factor or cluster dimensions, 


Le., reproducible in terms of those dimensions, the residual correlations 
after those dimensions are removed from the matrix will be zero. Just how 
residuals are specifically formulated will be considered later, but at this 
Point we can assess how well the original correlations of the Holzinger prob- 
lem are reproduced by noting the size of residuals after successive dimen- 
Sions. These are shown in Table 6.1, Sec. B. The overall statistic that 
describes the average magnitude of the elements in a correlation matrix 
I$ the root mean square (RMS), secured by squaring each element in the 
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TABLE 6.1 SUFFICIENCY OF KEY-CLUSTER DIMENSIONS (FACTORS) 
IN THE HOLZINGER STUDY OF ABILITIES 


A. Reproduction of Communalities during First Factoring 


Verbal Speed Spatial Memory 


с, с C: с. 

Proportion of sum of estimated | | 

communalities, each dimension | # | 1 16 10 
Cumulative proportion of | | | | 

communalities from | | | 

dimension C; to С, | 49 | 72 | .88 | .98 

B. Reproduction of Original Correlations after Fourth Iterated Solution 

Origi- 
nal Residuals after 


matrix and finding the square root of the mean of all these squares. The 
RMS may be thought of as a slightly upward-biased estimate of the average 
of all the values, disregarding algebraic sign. 

In Table 6.1, Sec. B, it is shown that for the 276 original correlations 
the RMS is .33. Looking now at the mean size of the elements in the matrix 
after extracting the first dimension, Ci, we see that the RMS of the first 
residuals is .16. The general combined sufficiency of four dimensions 15 
revealed by the fact that after the fourth dimension, C,, the fourth 
residuals have an RMS of only .04. This overall triviality does not neces- 
sarily mean that there might not be some residuals somewhat higher than 


:04. We consider this Possibility later when we look at the distribution of 
residuals. 


Reproduction of communalities 


In Chap. 5 we emphasized that the main indicator of the generality of 
individual differences in each variable is its communality, namely, the 
portion of its variance that is predictable from the other variables of the 
study. Before factoring the matrix, we can make an estimate of the com- 
munalities of the variables and use the sum of these estimates as a target, 
an amount to be reproduced by a sufficient number of dimensions. As 
we take out one dimension at a time, we calculate the proportion of this 
total amount of variance that has been accounted for, up to and including 
each dimension, and stop factoring when no salient amount is left to factor. 
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The results are given in Table 6.1, Sec. A. The first dimension, Ci, 
defined as ''verbal," takes out about half (49 percent) of the overall com- 
munalities, and the fourth, С;, takes out only 10 percent. The next row of 
the table, giving the cumulative communalities accounted for up to each 
dimension, shows that by the fourth dimension 98 percent of the com- 
munalities has been accounted for and only 2 percent is left over for any 
additional dimensions. Such a residual communality seems hardly worth 
devoting one or more additional dimensions to. 


The defining variables of each dimension 


In key-cluster factoring we choose as definers of each of the dimensions 
the most collinear subset of variables that is also most nearly independent 
of the definers of other dimensions. Since the derivative basis of the 
objective procedures for locating such a cluster is given in Chap. 12, we 
sketch here only the simple logic of the procedures. 

To locate the definers of the first dimension that are likely to be 
most nearly independent of succeeding clusters, we first look for a defining 
variable, called the ''pivot variable" of the clusters, that meets the following 
conditions: (1) it must be relatively highly correlated with some other vari- 
ables, in particular with additional definers-to-be of the cluster, and (2) 
at the same time it must tend to be uncorrelated with other variables, in 
particular, with definers-to-be of other clusters. Clearly such a pivot vari- 
able is one whose column in the correlation matrix has both high values 


in it and low values in it—in short, the variance of the absolute magnitudes 
or squares of its correlations is relatively large. From this reasoning, we 
therefore set up as an index of “pivotness”’ the variance of the squared 
correlations in each column of the correlation matrix. We arbitrarily accept 
that variable with the highest variance as the pivot variable around which to 
build the cluster that will define the first dimension. 

Which additional variables should be included in the defining cluster 
along with the pivot variable? Obviously they should be those variables 
that are most collinear with it and with each other, as measured by their 
indexes of collinearity P?. An objective criterion that has been developed 
and programmed into BC TRY sets three conditions for accepting a vari- 
able as a definer of the cluster: (1) it must show a relatively high average 
degree of collinearity with definers already accepted, a condition met only 
if the mean of its Р? values with them is high, say not less than .81; (2) it 
must show substantial collinearity with each one of them, a condition met 
only if all its P? values are above a lower bound, say .40; and (3), the addi- 
tion of it must preserve the ''tightness" of the cluster, a condition met only 
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if all its P? values with them lie within, say, twice the range of two P*'s 
of the pivot variable, namely, the value of the difference between its P? 
with the first definer added to the cluster and its P? with the last previous 
definer added. Note that there are three standard arbitrary bounds built 
into this programmed criterion of mutual collinearity. If a user of the BC 
TRY programs wants to set higher or lower bounds than the standard 
values, he can always opt to do so in using the programs. 

At times the arbitrary index of pivotness may select a pivot variable 
for which one cannot find additional definers that meet the conditions 
of mutual collinearity. We have therefore built into our factoring program 
a trial-and-error search routine that looks for a pivot that can form a 
collinear cluster if one exists. This search procedure works as follows. Ifa 
given selected pivot fails to pick up additional definers, it is rejected. A new 
pivot is then selected, the variable with the second highest value of the 
pivotness index. If this variable should also fail, a third is tried out, and 
even a fourth. Four trials are taken to be enough, and if the fourth effort 
to find a pivot variable fails, no further search is made. 

How the mutual-collinearity criterion works in the Holzinger problem 


TABLE 6.2 DEFINING VARIABLES OF KEY-CLUSTER DIMENSIONS (FACTORS) 
IN THE HOLZINGER STUDY OF ABILITIES? 


Orthog- | Orthog- 
onal onal 
Fc h? F Fc h? n 
Сл, Verbal: Сз, Spatial: 
V9 Wmn .86 76 72 F4 Loz 64 .58 46 
v6 Стр 83 69 69 Fl Vis 44 38 38 
V7 Snt .83 69 69 F2 Cub 47 30 33 
V5 Inf 80 65 66 N21 Puz 30 46 38 
Other Other 
v8 Wel 68 49 37 N23 Ser 39 50 42 
N22 Rsn 30 39 35 
| N20 Ded 30 42 35 
| F3  Fbd 38 28 32 
- | — = 
С», Speed: | Cy, Memory: 
510 Ада 79 76 59 || M14 Wrg 56 44 40 
$12 Cnt 70 56 49 | M15 Nrg 46 32 34 
N24 Ari 46 48 46 M17 Wn 47 45 38 
611 Cod 50 42 43 M16 Frg 43 48 38 
Other | Other 
$13 Scc 53 54 46 M18 Nf 26 43 33 
| M19 Fw 19 20 25 


в Abbreviations of test names are explained in Table 4.1. 


DISCOVERING SALIENT GENERAL DIMENSIONS 77 


is shown in Table 6.2. The definers of the first dimension, Сі, are the 
tightly collinear four verbal tests whose correlation profiles are displayed 
in Fig. 4.1. The fifth variable listed, V8, word classification, is not selected 
as a definer but is placed under the rubric ''other," denoting that it is a 
runner-up as a definer in the sense that its correlation with the first factor 
(its factor coefficient, to be discussed below) is the highest it has with any 
of the four dimensions. For each of the definers and their runners-up in 
Table 6.2, we list three quantities of interest: (1) its factor correlation Fc, 
(2) its communality (after factoring) h°, and (3) its mean original correlation 
With the selected definers 7. Each of the factor coefficients listed is the 
correlation of a variable with its dimension defined as independent of the 
other dimensions. (These so-called "опћовопа!" coefficients are to be 
distinguished from oblique factor correlations, discussed in Chap. 7.) 


Reflection of the defining variables of a dimension 


It is not uncommon to discover that two variables selected as definers ofa 
dimension are negatively correlated. Іп such cases (the Holzinger data do 
not provide an illustration), that variable also correlates negatively with 
other definers as well. Everything becomes straightened out if the offend- 
ing definer is “reflected.” Reflection is achieved by reversing the signs 
Of its correlation coefficients both in its column and its row in the correlation 
matrix. Such an operation means changing the “direction” in which individ- 
uals have been scored on the variable: the individual that earned the high- 
est score as originally scored is, after reflection, given the lowest, and vice 
versa. The result is that the name given the variable must be changed to 
its antonym or logical negative when interpreting its correlation coefficients 
with changed signs. 

Sometimes reflection operations can get complicated when more than 
one definer needs to be reflected. Here is a real advantage of the computer 
as a high-speed clerk: it can execute the optimal reflection in seconds. 
This procedure is described in detail in Chap. 12. In general, optimal reflec- 
tion means that all possible patterns of reflection of the defining variables 
(up to 10 definers) are tried out; for each pattern the sum of the correla- 
tions between the definers is computed. That particular reflection which 
yields a maximal sum of correlations between definers, as reflected, is 
finally retained. The result is that the definers are so reflected that one can 
be assured that the new directions in which the reflected variables are to 
be scored in a cluster score composite on individuals provide a cluster 
score that is rationally the most meaningful one possible. 
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Factor coefficients and partial communalities 


The most important basic statistics computed by the factoring procedure 
are the factor coefficients (or factor correlations) of all the variables on all 
the dimensions. A factor coefficient of a variable is simply its estimated 
correlation with a hypothetical score on the dimension. For example, from 
Table 6.2, the factor coefficient .86 of V9, word meaning, on the first dimen- 
sion, Cı, means that scores of individuals on this vocabulary test would 
correlate about .86 with a domain score on the first dimension, i.e., with a 
composite score on an indefinitely large number of tests perfectly collinear 
with the four variables that have been selected to define this first dimen- 
sion. (For the logic of the factor coefficient, see Chap. 12.) 

Table 6.3 gives the complete listing of the factor coefficients of all the 
24 ability tests with the four dimensions. For example, V9, word meaning, 
correlates, as we saw above, .86 with the first dimension, of which it is a 
definer (see the Definers column in Table 6.3); with the remaining three 
dimensions its factor coefficients are trivial, namely, —.07, —.01, and .03. 

Inspecting the values of the factor coefficients in the table gives one 
a notion of the weight of each of the dimensions in determining individual 
differences in each of the variables. A factor correlation is itself not the 
best index of the determination of individual differences by a dimension 
(or factor): the square of the factor correlation is the best estimate of the 
degree of variance predictable from the dimension. This squared factor 
coefficient is called the ‘partial communality'' of the variable on the dimen- 
sion. If, for example, one squares each of the four factor coefficients of 
word meaning, the resulting numbers give a more accurate notion of the 
variance in word meaning test scores determined by the successive dimen- 
sions than the unsquared coefficients would. Since the dimensions are 
independent, the factor coefficients are also multiple regression coeffi- 
cients of the variable on the dimensions conceptualized as multiple pre- 
dictors; the factoring procedures are merely a special case of multiple 
correlation and prediction. 

The point here, however, is that the cumulated partial communalities 
of a variable on successive dimensions show the predictability (the squared 
multiple correlation) from the dimensions conceptualized as a battery of 
predictors. The cumulative partial communalities are given in Table 6.3 in 
the columns under the general heading of Communalities. Inspecting these 
values at any point in the factoring procedure shows how well the variance 
of each of the variables is predicted from the dimensions. The cumulative 
values on the last, or fourth dimension, C,, аге, in fact, the final estimates 
of the communalities of the variables from factoring, one type of com- 
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munality estimate described in Chap. 5. This last column of values corre- 
sponds very closely to the initial estimates of communalities before factor- 
ing, listed in Table 6.3 in the column headed Communalities Initial. 
Indeed, as factoring proceeds, it is the overall sum of the cumulative par- 
tial communalities that is being compared with the sum of the initial esti- 
mates of the communalities. When the sums match up, factoring is 
terminated. 

How much each dimension accounts for the communality of a variable 
can be discovered by dividing its partial communality on each dimension 
by its total communality. The resulting fraction is the proportion of common 
variance in a variable that is predictable from each dimension. For exam- 
ple, note that the partial communality of V9, word meaning, from the first 
dimension is .75 whereas its total communality from all four dimensions 
is only a little higher, .76. Clearly the fraction .75/.76 — .987 means that for 
this test the first dimension leaves little common variance for other 
dimensions to account for. 

In practice, the factor coefficients are divided by the square root of 
the communality. These “соггесіеа” coefficients are called the ''aug 
mented” or "normalized" factor coefficients. The augmented factor 
coefficients are, as we see below, the coordinates of each variable that 
determine its locus in the geometric configuration as plotted by the BC TRY 
component SPAN. 


Residuals and reproduced correlations 


Each of the dimensions after the first and the set of variables that defines 
it are determined from a matrix of residual correlations. For example, hav- 
ing extracted the amount of correlation between the variables that is 
accounted for by the first dimension, the verbal factor, we have a matrix 
of first factor residuals left over, in which the definers of the second 
dimension, the speed factor, are located. By way of illustration, take the 
case of the residual between the first two ability tests, F1visual figure com- 
pletion and F2 cube similarities. The first factor coefficients with the verbal 
dimension are .39 and .26, respectively, as shown in Table 6.3. The correla- 
tion between these two tests as produced by the verbal dimension is cal- 
culated by the general formula for the reproduced correlation, called the 
“cross-product’’ or ‘fundamental factor theorem" (see Chap. 12). The 
general formula for the reproduced correlation of variables V; and |; 
due to the ith dimension, with domain 1), is 


ry, = rv,nTvuD. (6.1) 


DISCOVERING SALIENT GENERAL DIMENSIONS 81 


For example for Vi, Vs and the first dimension in Holzinger study, 


Table 6.3, 
49, = (.39)(.26) = .10 


TV, V: 


The residual correlation between V; and V+, without the influence of D; 


is the original correlation minus the reproduced correlation 


ер (6.2) 


fy;veb, = Тура — Тууу» 


Thus 
түүр, = -33 — 10 = 23 


The matrix of such first residuals therefore represents the amounts 
of correlation among all 24 variables after we have extracted the amounts 


ОҒ variance among the variables due to the verbal factor. It follows that 


the speed dimension is independent of, i.e., uncorrelated with, the first 
dimension. Since the second and third residuals are similarly derived from 
the speed and spatial dimension correlations, the last two dimensions, 
Spatial and memory, are also independent dimensions. e 

The sufficiency of the final set of dimensions to account for the опа!- 
nal correlations can be appraised from the distribution of residuals after 
the final dimension. They must, of course, be quite trivial if the dimensions 
are to be considered sufficient. Because the dimensions are independent, 
the reproduced correlations for all dimensions are subtracted from the 
original correlation, for each pair of variables. For V, V4, and the dimen- 


sion domains Di, Ds, . . - » DK 


K 
= и = У ти (6:3) 


TV;Ve DiD: ++ DK 
i=l 


For V, and V, in the Holzinger data of Table 6.3 


т Ve Dı D:D:D 
ира (39029 — (1910) — (4047) — (07-9) = 40 


Clearly, then Di, Ds, Ds, and D, are sufficient to account for the correla- 


tion of V; and V». ; , р 
Just how sufficient the four dimensions are in reproducing the raw 


correlations of all 24 variables can be seen in the last two columns of 
Table 6.3. For each variable we counted the number of final residuals with 
a first digit of .0 and the number with first digit .1 or greater. It is apparent 
from these two columns that quite generally all fourth-factor residuals 


are trivial. 
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Number of dimensions 


A troublesome problem in the long history of factor analysis has been that 
of deciding how many dimensions to define. There is, of course, no com- 
pletely general answer to this question. It depends on how salient one 
wants the dimensions to be. In key-cluster factoring by the BC TRY System 
the matter is left up to the investigator, though for programming purposes 
we put a stop on 15 dimensions. Experience has rather generally shown, 
however, that to factor into additional dimensions that account for less 
than, say, 5 percent of the estimated overall communalities is likely to 
result in dimensions whose defining variables either are difficult to con- 
ceptualize or are very narrow doublets that hardly seem worth dignifying 
as definers of a general dimension. 


Number of iterations of factoring 


Before factoring begins, estimates of the communalities are inserted in 
the diagonal cells of the correlation matrix. At the end of the first complete 
factoring procedure we have a revised set of communalities from the 
factoring itself. These new values are now inserted in the diagonal cells of 
the original correlation matrix, and a second factoring is made. This itera- 
tion process is repeated until a new set of communalities from factoring 
is the same as the last set, i.e., until the communalities converge. 

The number of iterations undertaken is determined by the numerical 
precision to which one wishes to achieve convergence. Experience has 
generally shown that in most problems convergence of all communalities 
is better than a .05 criterion for all variables by about the fourth iteration. 
In the Holzinger problem the largest difference in any communality between 
its third and fourth iteration (printed on the diagonal in the final residual 
matrix) is .02; 14 of the 24 are less than .001. 

Convergence through iteration is desirable not only to assure that 
stable factor coefficients are obtained: studies of artificial matrices where 
true values are known, because the matrix was produced by them, have 
indicated that only by iteration are the true values recovered (Tryon, 19576, 
table 1, item 2 on the CC method). 


The social-area study 


For comparison with the Holzinger study, we present the general results 
of key-cluster factoring of the correlations among the 33 demographic 
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characteristics of neighborhoods (census tracts) in the social-area study. 
Figure 4.1 shows sample variables from collinear clusters of three domains 
of demographic characteristics. How many dimensions does key-cluster 
factoring indicate would be sufficient to account for the generality among 
the 33 demographic attributes of neighborhoods? 

The main data on this question are given in Table 6.4, which is parallel 
in form to Table 6.1 for the Holzinger analysis. Compared with the cognitive 
tests of children, the demographic characteristics of neighborhoods have 
more generality, as shown by the RMS original intercorrelation of about .44. 
After the third dimension the mean residual is only .06. Extracting a fourth 
does not substantially alter it. Sufficiency of the key-cluster dimensions to 
reproduce the initial estimates of communalities is shown in Sec. A of the 
table. After the third dimension is defined, 94 percent of the estimated 


communalities is exhausted. 

The evidence indicates that there are clearly three salient demo- 
graphic dimensions of neighborhoods. Should we accept the fourth one that 
contributes only 6 percent to the communalities of the variables and that 
reduces the residuals so insignificantly? Here is an example of a common 
dilemma in factoring: whether to accept a last factor that seems of border- 
line value. We leave the matter unresolved until the next chapter, where 
we examine this fourth dimension in the context of the whole cluster struc- 


ture of the demographic features. 


ТАБГЕ 64 SUFFICIENCY ОҒ KEY-CLUSTER DIMENSIONS (FACTORS) 
THE SOCIAL-AREA STUDY (DEMOGRAPHY) GES 
rst Factoring 


" A. Reproduction of Communalities during Fi а 
Verbal | Speed | Spatial | Memory 
с. | с. Cs с, 
| 
| 06 


„Сол unalities, each dimension | 
Cumulative proportion of 

communalities from 

dimension C, to Са | 47 79 94 1:00 


=. — E 
Proportion of sum of estimated | | 
47 82- | 15 
| | 
| 


B. Reproduction of Original Correlations after Fourth Iterated Solution 


Residuals after 


Root Mean Square 


84 


The voting-attitude study 


The comparable data on the sufficiency of key-cluster factors in the voting- 
attitude study of small neighborhoods (precincts) are given in Table 6.5. 
The RMS of the original correlations among the 31 election issues is .38. 
The residuals after the second dimension shrink to only .07, those after 
the third dimension to .05, signifying that in general the third dimension 
has removed little additional correlation from the matrix. The first and 
second dimensions account for only 84 percent of the communalities of all 
the variables, whereas the third raises the figure to 91 percent. The problem 
here is whether to retain the fourth dimension. It contributes only 4 percent 
to the communalities, and it does not reduce the residuals in any significant 
way. It looks here, therefore, as if there were only three salient attitudinal 


dimensions of small neighborhoods. 


TABLE 6.5 SUFFICIENCY OF KEY-CLUSTER DIMENSIONS (FACTORS) 


IN THE VOTING-ATTITUDE STUDY 


A. Reproduction of Communalities during First Factoring 


Verbal Speed 


Spatial | Memory 


С, с С; с. 
Proportion of sum of estimated 
communalities, each dimension 44 40 07 04 
Cumulative proportion of 
communalities from | 
dimension C, to C, 44 84 91 9% 


B. Reproduction of Original Correlations after Fourth Iterated Solution 


Root Mean Square 


Origi- 
nal 
ОЙ 
38 27 | 


Residuals after 


C: Са с, 


07 05 04 


Chapter 7 


CLUSTER STRUCTURE ANALYSIS 


| hough the processes of key-cluster factoring indicate the salient num- 
ber of cluster composites to be formed from the variables of a study 
and also what the definers of the clusters may be, they do not describe 


the structure of the relationships among the variables, a matter now to 


be considered. 
The three main objectives of structure analysis are: 


1 Inner structure of clusters: to provide information permitting improve- 


ment in the sets of definers of each cluster that have been selected by the factoring 


procedures. 
2 Structural generality: to discover how important each cluster or factor is and 


how general is its kind of variation across all the variables of the study. 

3 Structural relationships (organization) of clusters: to discover how the dif- 
ferent clusters or factors are related to one other and to the variables of the 
study; knowledge of this total configuration aids in selecting the most nearly 
independent and meaningful clusters for O-analysis. 


There are two ways to describe structure. The first is by various 
statistical quantities. The second is graphical and geometric. Both quanti- 
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tative and graphical analyses are useful in understanding the cluster and 
factor structure. 

The first step is to study the results of the initial key-cluster factoring, 
as described in Chap. 6. Then, if the definers of the factored dimensions 
require revision, one revises them and reruns key-cluster factoring but 
this time ''presetting" the dimensional analysis on the revised set of 
definers. Both the original and the preset analysis include the Cluster 
Structure Analysis (called the CSA program in BC TRY), which gives the 
quantitative description of the inner structure of the clusters and their 
structural interrelationships and generalities. Also involved in both analy- 
ses is the BC TRY program component SPAN, which provides a graphical 
display of the configuration, namely, SPherical ANalysis. 


Preset key-cluster factoring 


Consider the results of the preset rerun on the Holzinger variables, in 
which the clusters are selected by the analyst from study of the empirical 
results. The first empirical key-cluster factoring described in Chap. 6 does 
not give the "best" definers of the four independent dimensions. The 
final metric description of structure is that given in the revised, preset run 
and nol in the original blindly empirical factoring that usually does not give 
an optimal solution. We describe in a later section the specifics for applica- 
tion of general criteria used to improve the selection of definers for the 
preset solution. Here the reader should simply note the composition of 
the revised sets, listed in Table 7.1 in the column headed Definers (Re- 
vised). There it will be seen that ҒЗ, form board, has been added asa definer 
of the third dimension, C;, and the two mathematical tests, N21, puzzles, 
and N24, arithmetic, have been deleted as definers, respectively, of dimen- 
sions C; and Cy. Тһе results of factoring on the revised definers in this 
case differ little from those of the initial factoring. The revised orthogonal 
factor coefficients and communalities of Table 7.1 should be compared 
with the corresponding values from initial factoring, given in Table 6.3. The 
residuals given in the columns headed .0 and .1 are also about the same, а 
few less in the .1 category for the spatial and speed tests and a few more 
for the memory and mathematical tests. 

Before leaving this matter we point out a great advantage of preset 
key-cluster factoring over the orthodox forms of factoring by the principal- 
axes and centroid solutions. Being able to preset the definers of key 
clusters makes it possible to define dimensions by any subset of variables. 
The analyst has complete command over the factoring process in key- 
cluster factoring. 
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Cluster structure analysis (CSA) 


Inner cluster structure 


The results of the revised analysis in the Holzinger study are given in 
Table 7.2 (copied from the CSA computer printout sheets). The first 
cluster, C;, the verbal cluster, is defined by the same four variables 
(marked D in the column headed Definers) as in the first factoring. But 
the second cluster, С», speed, is defined now by only the three variables, 
because in the revision we have deleted N24, arithmetic, as a definer. The 
third cluster, Сз, space, is altered by replacing N21, puzzles, by ҒЗ, paper 
form board. But the fourth, C;, memory, is defined as in the first factoring. 

Oblique factor coefficients (from CSA) are of central interest in 
structure analysis. They are to be distinguished from the orthogonal factor 
coefficients computed in the CC factoring procedure. The distinction will 
be clear from Table 7.1, where the columns of revised orthogonal factor 
coefficients of all the variables and oblique factor coefficients are clearly 
labeled. For example, the verbal test V5, information, has as its four ortho- 
gonal coefficients the numbers .80, .09, .07, and -.09, whereas its corre- 
sponding four oblique factor coefficients are .80, .40, .54, and .36. 

The cluster-defined dimensions in structure analysis are called 
oblique clusters because this term "oblique" has come to have the same 
meaning as the word “‘correlated”’ in factor-analytic jargon. They are oblique 
clusters because, being defined by composites of standard scores on their 
defining variables, they would normally be expected to yield nonzero corre- 
lations with other clusters. In contrast, the orthogonal dimensions derived 
by factoring, being defined by residuals from which variance from other 
cluster-defined dimensions has been removed, would necessarily correlate 
-00 with each other, hence be orthogonal. The terms oblique and ortho- 
gonal are really geometric terms, the meanings of which are quite clear 


when we come to the graphical geometric representation of cluster 
structure. 


Oblique unifactor structure 


As an aid in interpreting each oblique cluster we have programmed the 
cluster-structure component (CSA) to assign each variable of the study to 
that particular cluster with which its oblique factor correlation is highest. 
For example, in Table 7.2, under the four definers of Ci, verbal, marked D 
are listed three other tests, V8, N20, and N22, which have been put there 
because their highest oblique factor coefficients are with this cluster, a 
point that can be verified from the oblique factor coefficients in Table 7.1. 
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TABLE 7.2 INNER CLUSTER STRUCTURE OF THE FOUR BASIC AND EXPANDED 
OBLIQUE CLUSTERS IN THE HOLZINGER STUDY (OBLIQUE UNIFACTOR STRUCTURE)? 


Expanded 
Oblique Reliability Cluster 
Cluster Definers re h? 7 on D's 
Cumula- 
Single tive 

Cı, verbal: 

УМ  Wmn D .87 .76 72 

V7 Snt D .83 .69 69 

V6 Стр D 83 69 69 

V5 Inf D .80 .66 .66 

ма Wal .68 .50 .57 .90 .90(1)^ 

N20 Ded .52 .39 .43 .89 .90(2) 

N22 Rsn .52 .35 .43 .89 .90(3) 

Domain validity :95 

Reliability .90 
C», speed: 

510 Add D .81 70 .58 

S12 Cnt D S .60 .54 

S13 Scc .68 64 49 ЕН ЕСТІ) 

N24 Ari 63 48 45 82 | 850) 

51 Cod D 61 КЕ) 44 

Domain validity 2 

Reliability 79 
C3, space: 

F4 Loz D 49 57 45 

Fl Vis D .66 44 .40 

N23 Ser 64 47 38 35 | 2750) 

N21 Puz .59 .45 35 55 .79(2) 

F3 Ра D Ag 26 30 

F2 Cub D 49 25 30 

Domain validity 84 

Reliability 570 
С, memory: 

M14. Wrg D .64 .43 „39: 

M17 Wn D .62 .45 .38 

M16 Frg D .62 47 .38 

MI5 Nrg D .56 32 34 

M18 Nf .54 .42 33 76 76(1) 

Domain validity 85 

Reliability | 72 


a Variable excluded because 
5 For explanation of figures In p 


ause communalities are below .20: M19 figure-word recall. 
arentheses see text. 
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This kind of assignment is called an ‘oblique unifactor structure'' because 
each variable is assigned to only one of the factors (Harman, 1967, p. 104). 
Such an allocation can be an aid in interpreting the oblique dimension, 
because a study of the constructed content of these nondefiners often 
helps in conceptualizing the rational meaning of the cluster. The variables 
in the unifactor listing of Table 7.2 have been ordered by the sizes of their 
oblique factor coefficients. 


Expanding an oblique cluster for 
purposes of scoring individuals 


There is another important use of the unifactor listing. In revising the 
definers of a cluster one may have included, at the time, what appeared 
to be the best definers of the given domain, but it may be discovered that 
a cluster or factor score on these definers yields a reliability coefficient 
that is too low for scoring individuals. Generally it is desirable to have the 
reliabilities of any composite score .90 or more if individuals are to be com- 
pared with each other on that score, as is done in object cluster analysis. 
Information on the nondefiners in a unifactorial listing may help in expand- 
ing the number of variables in a cluster in order to increase its reliability 
coefficient. 

Below each cluster listing in Table 7.2 is information about the cluster 
Score formed from a composite of the basic definers marked D. For 
example, the domain validity of a cluster score composed of the four 
definers of С, is .95, whereas its а reliability coefficient is the square of 
this value, .90. Expanding a cluster to include additional nondefining vari- 
ables changes the reliability of the cluster score. The relevant information 
on the reliability of the clusters in the Holzinger study is given in Table 7.2 
under Expanded Reliability. Under Single can be found the value of the 
reliability of the composite formed by adding each of the nondefining vari- 
bles singly to the basic composite of four definers. For example, if V8 is 
added to form a new five-variable composite for С), the cluster reliability is 
unchanged. If either N20 or N22 is added, the reliabilities of the two five- 
variable composites are .89. There is no reason here for adding any of the 
nondefiners to the basic cluster of four. 

The reliability coefficient of the cluster composite when the additional 
nondefiners are added to it cumulatively and in the order of their single 
effects, as indicated by the ordinal numbers in parentheses, is given in 
the column headed Cumulative. Thus, when three nondefiners of the 
verbal cluster, V8, N20, and N22, are ordered by the sizes of the reliabilities 
resulting from adding them singly and then added in that order cumula- 
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tively to the basic four definers, the reliabilities of the expanded clusters 
remain unchanged at .90. The reliability coefficient of the composite score 
on the three definers of the speed cluster is only .79, but the table shows 
that it can be raised to .85 by adding the two nondefiners allocated to it, 
513 and N24. With the space cluster, its basic reliability can be increased 


from .70 to near .80. 
Table 7.2 includes the values of the communality 2? of each variable, 


and also each variable's mean original correlation 7 with the basic definers 
a variable with low values on these two statistics 
does not contribute to the reliability or meaning of a cluster. This finding 
is so universal that the CSA component of BC TRY is programmed to 
exclude variables from unifactor allocation when their communalities are 
lower than .20, as indicated in the first footnote to Table 7.2 for the М19, 


figure-word recall, test. 


of the cluster. Generally, 


Structural generality of an oblique cluster or factor 
y of a given cluster's variation is to note 
mong all the variables would be reduced 
f a variable. This effect is precisely 


One way to assess the generalit 
how much the intercorrelations à 


if the cluster were a constant instead o 
what is revealed in key-cluster factoring by the sizes of the first factor 


residuals after extracting the first dimension. To discover the generality 
of the different clusters we refactor the correlation matrix as many times 
as we have oblique clusters but on each refactoring merely define a first 
dimension, defining it by one of the sets of oblique clusters. Instead of 
computing first factor residuals in each case, however, we compute the 
reproduced correlations from the first factor coefficients, as described 
and illustrated in Chap. 6, in the section on residuals and reproduced 


Correlations. 


A simple index, then, of the generality of a given oblique cluster is to 


Compare the reproduced correlations with the actual values of the original 


correlations. The index chosen is simply the ratio of the mean squares of 
| variables divided by the mean squares 


If the reproduced correlations perfectly 
the reproducibility ratio is 1.00. If no 
duced correlations are all .00 and the 


the reproduced correlations of al 
of all their original correlations. 
match the original correlations, 
reproducibility occurs, the repro 
ratio is .00. 

The results in the Holzinger 
the first row the verbal and spac 
order .50, whereas those of the speed an 


study are given in Table 7.3, Sec. A. In 
e clusters have reproducibilities of the 
а memory clusters are nearer .30. 
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TABLE 73 STRUCTURAL GENERALITY OF THE FOUR OBLIQUE CLUSTERS 
(FACTORS) AND THE RELATIONSHIPS AMONG THEM 


A. Generality of the Oblique Cluster Dimensions 


| 
Ci, Verbal | С:, Speed | Сз, Space | Сі, Memory 


Reproducibility of mean 


squares of correlations Jl .32 47 34 
Reproducibility of 
communalities .50 .40 .48 41 


B. Correlations between Cluster Scores (Factor Estimates)" 


Сл, verbal (.90) .33 .46 .38 
C2, speed .33 (.79) .31 .38 
Сз, зрасе .46 .31 (.70) 38 
C4, memory .38 .38 .38 (72) 


C. Estimated Correlations between Cluster Domains 
(Common Factor Correlations)? 


Cı, verbal (1.00) .39 .58 47 
C», speed .39 (1.00) .42 ‚51 
Сз, зрасе ‚58 .42 (1.00) 53 
C4, memory 47 51 .53 (1.00) 


* а reliability coefficients in parentheses. 


Another index of generality is the degree to which the communalities 
of all the variables are reproduced by each cluster treated as the first 
independent dimension in key-cluster factoring. Recall that the amount 
of variance of a single variable predictable from a given dimension is the 
partial communality of the variable, namely, the square of the variables' 
factor coefficient on the dimension. If the partial communality on a dimen- 
sion equals the total communality, then that dimension fully accounts for 
the common variance of the particular variable. Therefore, by treating 
each cluster as a first dimension in factoring, another index of its gen- 
erality can be formed by summing the partial communalities of all variables 
on that dimension and dividing the sum by the sum of the total communali- 
ties of all variables. When the index value approaches 1.00, the given 
cluster's variation is completely general; near zero, its generality is very 
limited. 

The communality reproducibility indexes of the four Holzinger clus- 
ters are given in Table 7.3. The findings on the generality of the clusters 
by this index are about the same as by the reproducibility index of 
correlations. 
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Structural interrelationships among the clusters 


The definers of each of the oblique clusters are usually only a limited 
sampling of variables drawn from a collinear domain of variation. The 
cluster scores on each of these domains of variation are referred to in 
orthodox factor terminology as ‘‘factor estimates” or ‘‘measurements of 
factors" (see Harman, 1967, chap. 16). An understanding of the different 
kinds of cluster variation in a study is usually aided by computing the raw 
correlations between the cluster scores, as illustrated by Table 7.3, Sec. B. 
Solution for these correlations does not require actually scoring the children 
on the different clusters because the correlations can be computed directly 
from the matrix of correlations between all the variables (see Chap. 12). 

Inspection of the values in Table 7.3 shows that they are all greater 
than zero, meaning that despite the fact that the different clusters appear 
to sample quite different types of cognitive abilities, they overlap some 
common substrata of causes of individual differences. It is ill-advised to 
put much stock in these correlations, because each cluster score is subject 
to error since it is based on a limited sample of variables. The net effect 
of this error of domain sampling, usually called the “error of measure- 
ment," is shown by the size of the a reliability coefficients of the cluster 
scores, given in parentheses down the diagonal of the raw correlation 
matrix in Table 7.3. The smaller (away from maximum of 1.00) these reli- 
ability coefficients are, the less one should depend upon the score inter- 


correlations as representing domain intercorrelations. 
More important are the estimated correlations between domain 


scores, i.e., hypothetical scores on the children ‘‘freed of limitations of 
sampling each domain.” These estimates, called the ‘‘correlations between 
factors,” ‘common factor correlations," or, better, ‘‘interdomain correla- 
tions," are computed by the CSA program of BC TRY. They also have a 
geometric meaning, as we see in the next section when we plot the vari- 
bles in a graphical model showing the configuration of all the relationships 
among the variables. 

The values of the interdomain correlations are given in Table 7.3, 
Sec. C. They estimate what the correlations would be among the clusters 
if we had a great many tests in each of the clusters—additional samples 
of variables that were collinear with the observed set. They indicate what 
the relationships would be among the four different kinds of variations 
among the children if we were able to get a thorough measurement of the 
four kinds of variation in the verbal, speed, space, and memory clusters. 
The verbal cluster domain correlates highly with the space domain, and the 
space dimension seems to overlap on the other dimensions a bit more 
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heavily than the verbal does. The speed domain seems to be a more specific 
kind of variation, but still it shares some common variance with the other 
three cognitive domains. 

We obviously are dealing here with an oblique set of domains. This 
result is a common finding in the broad domain of cognitive abilities, one 
that has led some psychologists to assert the existence of a single common 
"underlying factor," often termed y (for general factor). It has provided 
such psychologists with the justification for lumping many kinds of tests 
like those of the Holzinger study into a single test battery, the total score 
on which is termed “general intelligence," scored in the form of the IQ. 
This simple idea is only one of many ways of interpreting such a matrix. 
A sounder one theoretically is to assume the operation of complex multiple 
environmental and genetic overlaps between the different cluster domains 
(Tryon, 1935). 


Graphical representation of structure: SPAN 


There is no substitute for a visual map as a means of describing the gen- 
eral structure of the relations among the variables. A pictorial display of 
the configuration shows at a glance the collinearity and structure of each 
cluster, its degree of generality with respect to all the variables, and the 
relationships among the clusters. The program of the BC TRY System that 
prints the configuration is the SPherical ANalysis component (SPAN). If 
the dimensionality of a matrix is three, SPAN presents the configuration 
of variables as a surface layout on a single sphere (a space of three dimen- 
sions), so that by capitalizing on clues for depth perception, опе can "5ее” 
the configuration. If there are more than three dimensions, SPAN breaks 
up the configuration (which now lies in a hypersphere of more than three 
dimensions) so that only those parts of the configuration that can be 
visually represented on spheres are printed. Each such part is called a 
"dimension set” or "subspace." By cross-referencing the partial configu- 
ration оп the different dimension sets one can usually come to “see” the 
whole configuration as it lies in the hypersphere of more than three 
dimensions. 

Here we illustrate the graphical representation of the configuration 
of the variables of the Holzinger analysis. The orthogonal factor coefficients 
of the 24 tests on the four dimensions are given in Table 7.1. Since these 
dimensions are independent, they can be represented as four cartesian 
axes set at right angles to one other. Each of the 24 tests can therefore 
be plotted in this space by its four coefficients treated as coordinates. But 
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this is not exactly the way SPAN proceeds. First, the factor coefficients of 
each test are augmented by dividing the four of them by the square root 
of the sum of their squares, i.e., by /, the square root of the communality. 
The resulting values are called the augmented factor coefficients of the 
test. 

Table 7.1 gives the augmented factor coefficients of the 24 tests. For 
example, in the first test, F1, visual figure completions, the squared aug- 
mented values .59, .29, .76, and .03 sum to 1.00 (except for rounding errors). 
This relationship holds for all 24 tests, a fact that has an interesting con- 
sequence: when each test is plotted on four independent axes using its 
augmented values as coordinates, all 24 tests lie on the surface of a hyper- 
sphere in four dimensions. If the factoring procedure had given a three- 
dimensional solution, all 24 tests plotted by their three augmented values 
would have been spread out on the surface of a sphere. With a four- 
dimensional solution, those particular tests that have trivial squared 
augmented values, like the five verbal tests, would spread out on a sphere 
defined by the first three dimensions only, since their first three squared 
values add up to 1.00. In fact, some of these tests can be plotted on a 
"sphere in two dimensions," i.e., on the circumference of a circle defined 
by the first two dimensions; these are the tests whose last two squared 
augmented values are trivial, namely, the four verbal tests, and S11 and 
N24, whose first two squared coordinates approach 1.00. In short, one can 
usually organize the four-dimensional configuration into subspaces in each 
of which the full coordinates in four dimensions can be accounted for in 
three, two, or perhaps one dimension. 

This manner of depicting а configuration of points that as a totality 
ation in a nonvisual space of more than three dimensions, 
haps be developed better if we start from one 


dimension and work up to а hyperspace. m Fig. 7.1 the tests have been 
Spread out on the first dimension according to the values of their aug- 
mented coordinates given in Table 7.1. The five verbal tests cluster at the 
right end of the axis near 1.00. Reliance on the configuration of all 24 tests 
represented on this single dimension would lead to some big mistakes 
about their relationships. Though the tests that lie near 1.00 will stay 
together as we add dimensions, others not near 1.00 but together in this 


require represent 
i.e., a hyperspace, can per 


FI с T 
НЕДЕ Ы of 24 abilities (Holzinger study) plotted by their first augmented factor 
coefficients. 
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FIGURE 7.2 


The configuration of 24 abilities (Holzinger study) plotted by their first and 
second augmented factor coefficients. 


first dimension are likely to move apart. For example, the two speed tests 
S11 and S13 appear to cluster with the four space tests F1, F2, ҒЗ, and РА. 
Adding a second dimension radically changes this impression. 

In Fig. 7.2 the configuration of the 24 ability tests is plotted by their 
first two augmented coordinates. The two tests of speed that previously 
clustered with the four tests of space have spread apart into different 
sectors of this two-dimensional space. The five verbal tests still remain clus- 
tered in this two-space. Figure 7.2 is a circle in a plane, the two agumented 
coordinate axes of which are set orthogonal to each other and the radius 
of which is 1.00, a unit circle. Variables that lie out near the end of the 
radius cannot be relocated by adding more dimensions. This is true for 
the four tests of speed that are near unit distance from the origin of the 
circle. 

Figures 7.1 and 7.2 illustrate the real difference between a cluster 
and a factor. The variables іп the verbal cluster, Су, are all grouped near 
the value 1.00 on the factor dimension defined by the cluster. All other 
variables in the study are also located on the axis for the dimension, as 
represented in the figures. In Fig. 7.2, where a second factor dimension 
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is defined by the speed cluster, C2, the verbal variables аге still grouped 
near the value 1.00 on the first axis while the other variables have scattered. 
The variables of С, are grouped together in the two-dimensional plot (they 
form a geometric cluster), but they do not fall so near the 1.00 terminus 
of the axis as C, variables do on the first axis terminus. The clusters in 
cluster analysis are selected to be as nearly independent among them- 
selves as possible. If the clusters C; and C» in Fig. 7.2 were independent, 
they would appear grouped around the termini of their respective factor 
dimensions. However, the empirical relationship between Ci and C is not 
that of independence. The factor dimensions are strictly independent, 
and must be, because the factoring processes define each successive 
factor dimension in that way. The factor dimensions are orthogonal (at 
right angles) to each other while the clusters themselves are oblique (not 
at right angles) to each other with respect to the origin of the circle. 
Figure 7.3 shows the variables plotted by their three agumented 


N20 N22 
° e 
Ded Rs 


Ei 
12-7 Speed 


pu t 
“ы. 51 
RS 


FIGURE 7.3 M 
Configuration on the surface of a sphere: 19 abilities plotted by the three 
augmented coordinates Hi, Hz, Из accounting for their communalities. 
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coordinates. This time only those 19 tests that lie near the surface are 
included in the plot in this space of three dimensions. They all lie at or 
near the surface of a sphere, because any point the sum of squares of 
whose three factor coordinates equals 1.00 necessarily lies on the surface 
of a sphere. The sum of a variable's three squared augmented coordinate 
values equaling 1.00 also means in this figure that 100 percent of its com- 
munality is accounted for in this space. 

In Fig. 7.3 circular contour lines help to visualize the third dimension. 
The configuration has been physically rotated (by the SPAN program) so 
that the swarm of test points is centered in the line of the eye. At the lower 
left the five verbal tests are clustered together at the end of their axis, 
labeled Су, and set in a hexagonal box. At the lower right are the four speed 
tests. The orthogonal axis that they defined is shown nearby, boxed as C». 
The third dimension, C;, is defined by the tests of space, so at the upper 
left are found the four F tests that define the third orthogonal axis clus- 
tered together, near the axis boxed as C;. 

The ends of the three axes are connected by arcs so that a surface 
right triangle is formed by them. The arc that connects C, with С, can be 
thought of as the place where a slice (or plane) through the sphere has 
been made; Fig. 7.2 is precisely this plane. 

The important point to be made is that the configuration of these 
19 tests in this subspace is going to stay put however more dimensions 
we add. It has already been seen that the five verbal tests remain clustered 
when a second dimension is added and both they and the speed tests 
stay put when a third is added. This means that all 19 will stay where they 
are (or change only trivially) if a fourth dimension is added. Therefore, а 
fourth dimension is unnecessary to describe their configuration; three 
dimensions are sufficient. 

In Fig. 7.3 the definers of each of the three dimensions, as listed in 
Table 7.2, have been circled by dotted lines. The circled points represent 
the oblique clusters. The nondefiners unifactorially allocated to each 
oblique cluster in Table 7.2 are in fact near the oblique cluster to which 
they were allocated. In brief, the spherical configuration of Fig. 7.3is merely 
а geometric model of the quantitative values given for the three clusters 
in Table 7.2. 

There remains the fourth orthogonal dimension defined by four tests 
of memory. They do not lie in the Space just discussed, but they do lie in 
another three-dimensional space defined by С), С», and C,, shown in Fig. 
7.4. Sixteen tests lie at or near the surface of this sphere. The verbal and 
speed clusters are carried over from the previous sphere because they 
lie in the С.С. plane that is common to the two spheres. The orthogonal 
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FIG ? К 
Па 08 on the surface of a sphere: 16 abilities plotted by the three 
augmented coordinates Hi, Н>, На accounting for their communalities. 


dimension С, lies off 90? away from the first two axes (and alsa in a fourth 
dimension away from Са which of course is not represented in this figure). 

The whole configuration of the 24 tests сай be observed on these two 
Spheres. Perhaps the reader can see itin his mind's eye in one perceptual 
grasp by butting together the common фин Speed: arc of the two figures 
but keeping the C; and С; apexes spread apart by 90". The whole configura. 
tion lies within this frame. In any event, the whole configuration can be 
portioned into the two SPAN sets. The inner structure of each oblique 
cluster is pictorially evident in them, and the degree of relation between 
the oblique clusters is plainly displayed. 

The distinction between factored orthogonal dimensions and the 
oblique dimensions becomes clarified in the contigHirations of Figs. 7.3 and 
7.4. In Fig. 7.3, for example, the orthogonal dimension and its oblique 
counterpart defined by С: are identical, because the first factor dimension 
is represented by an axis that passes directly through the first oblique 
cluster, the verbal dimension. But the second orthogonal dimension is 
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independent of the first (since it is derived from residuals); its definers are 
the tests of the oblique speed cluster, С». This factored orthogonal dimen- 
sion is represented by an axis that passes as near to the second oblique 
cluster as it possibly can, subject to the condition that it must be 90? away 
from the first dimension. Looked at from the central origin of the sphere, 
the orthogonal factor dimensions derived from factoring are, as it were, 
pointers at oblique clusters, but they are prevented from pointing exactly 
at them because of the rigid requirement that they must be at right angles 
to each other. Thus, the third factor dimension points as accurately as it 
possibly can at the third oblique cluster, C;, which constitutes its set of 
definers, subject to the condition that it must Бе 90° away from both the 
first and second factor axes. 

The orthogonal factor coefficients of Table 7.1 are the correlations 
of the variables and hypothetical variables corresponding to the termini 
of the respective orthogonal factor dimensions. The oblique factor coeffi- 
cients of Table 7.1 are the correlations of the variables and composite 
variables corresponding to a sum of the cluster variables but with com- 
munality of 1.00. The specific definitions of these coefficients by mathe- 
matical equations are given in Chap. 12. 


Criteria for final selection of clusters 


In many studies the configuration is not sharply structured into clusters 
of variables. In some parts of a study a clear cluster structure may be 
evident; in other parts the variables may form a graded series with no evi- 
dent concentration that can form a cluster; or the whole configuration may 
form a graded series. Indeed, in object cluster analysis, where the points 
in the configuration are persons and not variables, the absence of any 
clear clustering may be the usual case. Nevertheless, whatever the struc- 
ture of the configuration may be, the criteria of cluster selection that are 
programmed in the BC TRY System key-clustering procedure select pivot 
variables at the edge of the configuration and add definers to them by 
the mutual collinearity criterion. 

When cluster structure is poor, the quantitative description given 
by the CSA component sometimes misleads one into assuming that a 
sharper structure is present than in fact does exist. Only by inspecting 
the whole configuration does the real structure become clear. The mean- 
ing and existence of the parts of a cluster structure are determined by 
the broad shape of the structure, and this broad shape can be observed 
properly only in the pictorial displays of SPAN and not by the quantitative 
metrics of CSA. 
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It is therefore of the greatest import that one assess how well the 
factored dimensions serve as pointers to clustered sectors of the con- 
figuration. If the factoring program does not pick the most salient 
cluster-defined dimensions, or if the definers of any particular dimen- 
sion are not the best ones that could be selected, one should revise the 
factorial description. Revising it means refactoring the correlation matrix 
by presetting the number of dimensions or by revising the definers of 


one or more dimensions. 
There are five criteria to keep in mind when assessing the results 


of the initial factoring procedure: 


1 The degree of collinearity of the definers of each dimension 
2 The degree of independence of the oblique cluster that defines each 


dimension 
3 The meaning of each defining oblique cluster as a construct 


4 The contribution of the definers to the reliability of a cluster score on the 


oblique cluster 


5 The generality of each oblique cluster and of the variables that define it 


Initial factoring in the Holzinger problem selected the verbal cluster 
as the first dimension, defined by the four verbal tests indicated in 
Table 7.1 by C; in the columns headed Definers (Revised). On the SPAN 
diagrams the definers are seen to be highly collinear (see also Fig. 4.1). 
The question of whether to add variable V8 to the four definers is resolved 
by applying the five revision criteria to it. Its collinearity is high with the 
basic four, and it fits in with the verbal meaning of the other four definers, 
but we find that it would be the least independent definer with respect to 
other clusters and that by expanding the cluster to include it the reliability 
coefficient of the cluster score would not be increased. Furthermore, its 
h lower than the communalities of the four basic 
s to be no point in adding V8 to the verbal 
ation for these evaluations is all in Tables 


communality is muc 
definers. Hence there appear 
cluster. The pertinent inform 


7.1 and 7.2. 
Initial factoring selects the speed cluster for the second dimension, 


defined by three speed tests plus the mathematical test N24, arithmetic. 
But, in both SPAN figures (Figs. 7.3 and 7.4) this mathematics test is not 
highly collinear with the other definers and fails a bitin independence since 
it lies in the direction of the verbal cluster. Furthermore, including arithmetic 
with “pure mental speed” seems inappropriate. The arrow drawn in the 
figures indicates that on these grounds this test was rejected as a definer 
of the second dimension. The reliability of the three-variable speed cluster 
(N24, arithmetic, left out) is .79. With $13, straight or curved capitals dis- 
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crimination (Scc), another test of speed, added to the cluster the relia- 
bility would be raised to .84. The SPAN figures reveal, however, that 513, 
Scc, has only fair collinearity with the other definers. The communality 
of $13, Scc, is as high as those of the three definers, but its independence 
of other clusters is variable. In meaning, it fits in with the other definers 
as a test of mental speed. The criteria, therefore, appear to encourage 
adding this fourth test to the definers of the speed cluster. 

A final word should be said on the criterion of generality. It commonly 
occurs on the initial factoring that one or more of the last dimensions may 
be defined by only two variables (a ''doublet'') or by definers that have only 
very trivial residuals on the basis of which they are selected as a dimension. 
Such dimensons have low generality, and in general should usually be 
rejected in toto as dimensions, the reason being that the object of V-analysis 
is to select salient dimensions, not trivia. 

In revised factoring we excluded the five mathematical ability tests 
from the four core special ability clusters because we wish (under the 
broad design of the Holzinger analysis) in a later study of prediction from 
clusters to discover the degree to which the mathematical tests represent 
more general attributes predictable from the more special verbal, speed, 
Space, and memory abilities. From the nature of the overall cluster struc- 
ture, such prediction looks promising because the mathematical tests 
Occupy a central position in the total configuration. This fact means that 
they are correlated with the four core clusters. It also explains why, under 
the revision criterion of seeking the most independent definers, they were 
excluded from the core clusters. They, or some of them, can form a 
dependent cluster, a composite score of which is highly predictable from 
the four core clusters. We examine this matter of the predictability of 


dependent clusters and variables from the minimally sufficient core clus- 
ters in Chap. 10. 


Cluster structure of demographic characteristics—the social-areas study 


At the beginning of the social-areas study of the San Francisco Bay Area 
in the middle 1940s this question was asked: Of the large array of demo- 
graphic characteristics of people living in urban neighborhoods, can we 
empirically discover a minimal set of demographic clusters that sufficiently 
describes differences between the neighborhoods? No one then knew the 
answer—either about the Bay Area or anywhere else. The scientific quest 
for an answer consists in observing many demographic characteristics of 
the neighborhoods and then, following the empirical key-cluster factor- 
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ing of them, discovering their cluster structure by the methods just 
described. On each of the 243 neighborhoods of the Bay Area the 1940 
census provides the 33 characteristics listed in Table 7.4 (for details see 
Tryon, 1955, app. A). The census categorizes these characteristics into 
three groups: population characteristics, occupations, and home features. 
No one would reasonably argue that three rational composite scores should 
be formed from these three heterogeneous categories. An empirical key- 
cluster factoring of the intercorrelations between the 33 variables, followed 
by cluster structure analysis, reveals clearly and objectively that three 
salient oblique clusters, named S, socioeconomic independence, F, family 
life, and A, assimilation, describe all the general variance among the 
neighborhoods. 

The results of the analysis are depicted in the total configuration 
of the variables in the SPAN diagram of Fig. 7.5, which shows the revised 


-о under 
hemisphere 


.-/. Assimilation 
mme Si 


independence 


SURE ҮТ бави showing the relationships among 33 demographic 
characteristics of Bay Area neighborhoods (1940 census tracts). 
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TABLE 7.4 THE SOCIAL-AREA VARIABLES 


Original Rational Census Categories 


Alphabetic Reference Listing 


Population characteristics: 


Af 
Am 
Cf 
Co 
Fe 
F 
Nw 
Sc 


Older females 

Younger males 

Childless females 

College education 

Foreign from northwest Europe 
Females 

Native-born whites 

High school or better 


Occupations: 


Df 
Ef 
Em 
Is 
Mf 
Mm 
Of 
Om 
Sm 
Uf 


Um 
wf 
Wm 
Homes 
Bf 
Ch 
Cp 
Fd 
FI 
Gr 
Oo 
Re 
Rn 
Uc 
MI 
Wo 


Female domestics (living in) 

Employed females 

Employed males 

In school 

Managerial-professional females 

Managerial-professional males 

Own-account females 

Own-account males 

Skilled males 

Nonworking females 
(housewives) 

Nonworking males 

White-collar females 

White-collar males 

(Dwelling Units) 

Better fuel 

Central heating 

"Nonfamily" couples 

Family-detached homes 

Large families 

Good repair 

Owner-occupied homes 

Refrigeration 

High rent 

Undercrowded 

High home value 

Wood-fuel use 


Older females 

Younger males 

Better fuel 

Childless females 

Central heating 

College education 
""Nonfamily'' couples 

Female domestics (living in) 
Employed females 

Employed males 

Females 

Family-detached homes 
Foreign from northwest Europe 
Large families 

Good repair 

In school 
Managerial-professional females 
Managerial-professional males 
Native-born whites 
Own-account females 
Own-account males 
Owner-occupied homes 
Refrigeration 

High rent 

High school or better 

Skilled males 

Undercrowded 

Nonworking females (housewives) 
Nonworking males 

High home value 

White-collar females 
White-collar males 

Wood-fuel use 


definers of the three oblique cluster dimensions S, F, and A circled by 
broken lines at or near the corners of the triangle. The meaning of each 
of these three core dimensions of urban social structure is apparent from 
the nature of the definers and of other variables with high oblique factor 
coefficients on them, as shown in Table 7.5. (The alphabetic listing of the 
variables in Table 7.4 will help in reading the figure.) 

At the lower left is S, socioeconomic independence, the dimension of 


CLUSTER STRUCTURE ANALYSIS 


TABLE 7.5 CLUSTER AND FACTOR STRUCTURE: THE SOCIAL-AREA STUDY 
(OBLIQUE UNIFACTOR STRUCTURE IN CSA) 
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A. Inner Structure of the Clusters 


Reliability 
Cluster 
Oblique Score 
Variables Definers| Fe h? 7 ae [casey [и УГЕ 
Cumula- 
Single tive 
Ci, socioeconomic 
independence, S 
2 Mm D 1.02 1.05 .86 
3: ‘Sc .90 .97 .76 .94 .94(1) 
1 Wm .90 .92 76 .94 | .96(2) 
27 Df 81 .78 68 .93 .96(3) 
26 VI D .81 .68 .68 
29 Co D .79 .66 67 
4 Ch .79 :77 E^ 93 -97(4) 
D 78. .56 " 
С 1” 70 A 59 .93 | .97(5) 
21 Ef 60 9 50 91 | .97(6) 
17 Mf 59 45 50 91 | .96(7) 
Domain validity P. 
Reliability ; 
Co, family || 
2 Te life, F 5 p ES B 
6::бо р .92 90 2 
8 Fd р .88 .80 .78 
9 Uf D .87 81 77. 
10: Am р .83 72 174 
M Cp 80 .82 E .96 .97(2) 
13 Wo 70 56 62 95 | .97(4) 
16 cf —.69 .80 62 .96 | .97(3) 
15 Af ENS EC DEC: .96 | .96(1) 
3l .54 .50 .48 .95 96(5) 

's 98 
Domain validity 96 
Reliability 

тавите. А Б dj 85 64 " s 
E e 8 275 .59 A (3) 
12 Nw x muc E 90 | .9x2 
qo ENS “80 76 5 4 .92(2) 
20 R .80 22 58 89 94(4) 
= 79 79 57 90 а) 
19 Rn `77 65 56 
Sr NE 2 5 42 45 
24 Fe D S 122 2 
SE Р 2' 2 | 3 86 | 96) 
30 Gr 39 2 | 29 85 93(6) 
32 Um 93 
Domain validity g 
Reliability 
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TABLE 7.5 CLUSTER AND FACTOR STRUCTURE: THE SOCIAL-AREA STUDY 
(OBLIQUE UNIFACTOR STRUCTURE IN CSA) (Continued) 


B. Generality of Clusters 


с, с. с: 


Reproducibility of mean square of correlations | 4l 26 | 45 


Reproducibility of communalities || gas | Шел 45 


C. Original and Interdomain Correlations between Clusters 


| Original Interdomain (cos 0) 
ЕТ б er ико Gi. лез 

22е. 2 | ==) Ж 5 
с | (о) | әш | ж | 10 | – от | а 
бе =i C9) | 32 | = | 10 | 35 
Cs 35 | a2 | 685 | 40 | 35 | 100 


Notes: See Table 7.4 for explanation of abbreviated census categories. 
Variables excluded because communalities below .20: 33 Of. 


wealth and social independence. Its meaning is evident in neighborhoods 
that would have high values of S: they include relatively more people work- 
ing for self (Om), possessing more costly homes (МІ), having had an expen- 
sive college education (Co), and being managerial-professional males 
(Mm). The reliability coefficient of cluster scores on the four definers is 
.91, but it could be increased to .95 or higher by adding more variables, aS 
shown in Table 7.5. 

Virtually independent of S is F, family life, in the right middle of 
Fig. 7.5. Neighborhoods with high scores on F would orient around the 
family: in them are relatively more housewives (Uf), more children (Am), 
more owner-occupied (Oo) single-family homes (Fd), housing larger families 
(Fl). The concept of such neighborhoods as family-centered habitats 15 
strengthened by two other characteristics, но! containing childless females 
(Cf) or older females (Af), indicated by negative correlations of these 
attributes with F (in Table 7.5) and represented in SPAN as unfilled dots 
near F but on the underside of the hemisphere. 

A, assimilation, in the upper sector of the configuration, refers to 
the degree to which a neighborhood includes persons belonging to the 
white Protestant middle-class majority class in America. The meaning of 
this dimension is clarified by noting the kinds of neighborhoods that would 
score low scores on A: they include relatively more persons who are foreign: 
born or nonwhite (Nw), foreign persons who do not come from Protestant 
Europe (Fe), more unskilled men among blue-collared workers (8m), more 
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blue-collared women among those at work (Wf), and an excess of males 
(F), as in Skid Row. 

The demographic attributes that form the ‘‘socioeconomic gradient'' 
range from those variables which roughly demark assimilation at one end 
to those which clearly denote socioeconomic independence at the other. 
In between is a group of variables that could serve as definers either of 
one or the other of the extreme clusters. A collinear ‘dependent cluster'' 
is formed out of those in the middle, Ach, social achievement. On this 
dimension, neighborhoods with high cluster 
Scores represent success in various forms of middle-class achievements: 
being employed (Em, Ef), men in white-collar work (Wm), in undercrowded 
homes (Uc), of high rent when rented (Rn), and equipped with modern 
appliances (Re, Ch). 

The generality of the core three clusters, S, F, and A is statistically 
given in Table 7.5, Sec. B. The two dimensions S and A at the ends of the 
Socioeconomic gradient are more general than F, since they encompass 
more variables. The correlations among the three clusters are given in 
Sec. C of the table, showing, what is also pictorially represented in SPAN, 
that family life is quite independent of socioeconomic independence: the 
zero correlation means that neighborhoods with high home life may be 
rich or poor, that one can live in solitary splendor or as an isolate in a hole 
in the wall. On the other hand, assimilation is mildly related to family life, 
there being some tendency for neighborhoods of the social majority, 


high E, to orient around family life, high F. | | 
The steps by which the cluster structure of the social areas is unfolded 


are indicated by the data of Table 7.6. In the initial factoring, tony ortho- 
gonal dimensions are found. The factor coefficients of all the variables are 
given in the table, as well as the distribution of their residuals. The 
Sufficiency of the initial four dimensions is given at the bottom of the 
table. The fourth dimension accounts for only 6 percent of the estimated 
communalities. The definers of this fourth dimension are given in the 
column Definers (Revised). We concluded that they are a trivial meaning- 


less doublet. І M | 
The factoring is therefore revised, following the criteria described 


earlier. For example, the variables Sc and Wm are deleted; they are marked 
by the notation “Сі out" in Table 7.6 and by rejecting arrows in SPAN. 
Variables Co and Om are added to Сл. The purpose is both to increase the 
independence of this S cluster from assimilation and to put mors stress 
on the meaning of social independence in the revised cluster. Similar prin- 
ciples governed the revision of the third cluster: by rejecting all the initial 
definers of this cluster and by replacing them with those circled as Cs in 


“unnecessary” dependent 
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the SPAN diagram of На. 7.5, its meaning was shifted from social achieve- 
ment to assimilation. The fourth dimension, being trivial, was completely 
eliminated in the revision. The results of revision are shown in Table 7.6 
under Revised Factoring. Of special interest is that the residuals, after the 
revised third dimension, are only slightly greater than the initial residuals 
after the fourth blind dimension. 

To summarize the findings on the structure of demographic charac- 
teristics of these urban people: by empirical cluster methods we have dis- 
covered three meaningful salient cluster-defined dimensions of their 
Social structure, S, socioeconomic independence, F, family life, and A, 
assimilation. The rational census categories, which group attributes 
by their population characteristics, occupations, and home features, 
provided no helpful clues about such dimensions; we find, indeed, that the 
definers of each of the empirically derived clusters cut across the rational 
categories. A new fourth cluster, social achievement, could also be used 
to describe the people, but its defining variables are largely statistically 
dependent on the three core clusters and therefore are not needed as a 
core dimension. 

When we come later to the phase of locating social areas by O-analysis 
Procedures, we must make a final decision about how to score each neigh- 
borhood. For scoring purposes we take the definers of the family life 
dimension as revised, since it is sharply differentiated in the configuration. 
Most of the other variables lie on a socioeconomic gradient that has no 
Sharp cutoff into clusters. The assimilation end of the gradient is not as 
tight in its definers as we would like, but there seems to be no way to 
improve its scoring. At the independence end, we probably should elimi- 
nate VI, high home value, because it is scored in dollars, a metric that can 
change critically and rapidly over time due to the shifting recession-infla- 
tion tides of the economic well-being of the society. This would leave only 
three definers, too few for reliable scoring. We therefore add two other 
definers, Df, domestics (living in), and Uc, undercrowded. The cluster 
extended by these two attributes gives a highly reliable cluster score and 
also augments the independence of the central construct at the ‘‘high’’ 


епа of the socioeconomic gradient. 
The statistical description of the finally revised clusters (as computed 


in a rerun of CSA) is about the same as before for the three clusters S, F, 
апа А. The new dependent cluster Ach, social achievement, composed of 
Seven definers, forms a cluster score with a very high а reliability of .97. 
As is evident in the SPAN diagram, the Ach domain is highly correlated 
with S and A, its interdomain correlations with them being .80 and higher 
and its raw correlations not being much lower. The correlation of Ach with 


F, family life, is quite low, of the order .20. 
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Cluster structure of voting attitudes 


The power of cluster structure analysis to reduce many variables to a 
small meaningful set of collinear clusters that accounts for the generality 
in them all is perhaps no better illustrated than in the voting-attitude study. 
The question asked is this: In a typical election, into how many attitude 


TABLE 7.7 THE VOTING-ATTITUDE VARIABLES 


Issues in Order of Presentation to Voters 


Alphabetical Listing 


State issues: 
1 DmGv 
2 VtBn 
3 ScBn 
4 AlCn 
5 NdAg 
6 Vsll 
7 LgPy 
8 LnTt 
9 у512 
10 ChEx 
11 TrOf 
12 DsVt 


13 ExFI 
14 VrCt 
15 CoEx 
16 WIEx 


17 бум 


18 Pk 
19 АіпР 


20 NnLw 
21 Chrt 
City issues: 
22 ЕхВп 
23 RcBn 
24 HsBn 
25 AgBn 
26 ВІЈЬ 
27 SpPy 
28 AdOf 


29 HsEm 
30 ShEm 
31 CbCr 


Democrat for governor 

Veterans bonds 

School bonds 

Alcohol control 

Needy aged aid 

Vessel (carrier) tax-exempt 

Legislators' pay increase 

Land title law 

Vessel (fishing) tax-exempt 

Church tax-exempt 

Terms of office 

Disabled veterans 
tax-exempt 

Ex-felon voting 

Vernon City charter 

College (private) tax-exempt 

Welfare institutions 
tax-exempt 

Government water state 
control 

Parking facilities by state 

Alien property ownership 
ok'd 

Nonlawyer judges ok'd 

Charters (county) loosened 


Exhibit hall bonds 

Recreation center bonds 

Hospital bonds 

Aged home bonds 

Blind city jobs ok'd 

Supervisors' pay 

Administrative officer (city) 
appointment 

Hospital employee benefits 

Sheriff's employee benefits 

Cable cars saved 


28 


25 
4 
19 


26 
3l 
10 
21 
15 

1 
12 


22 
13 
17 


24 
29 


~ 


AdOf 


AgBn 
АІСп 
АПР 


ВІЈЬ 
CbCr 
ChEx 
Chrt 
CoEx 
DmGv 
DsVt 


ExBn 
ExFI 
GvWt 


HsBn 
HsEm 
LgPy 
LnTt 
NdAg 
NnLw 
Pk 
RcBn 
ScBn 
ShEm 
SpPy 
TrOf 
VrCt 
ГЕН 
у512 
VtBn 
WIEx 


Administrative officer (city) 
appointment 

Aged home bonds 

Alcohol control 

Alien property ownership 
ok'd 

Blind city jobs ok'd 

Cable cars saved 

Church tax-exempt 

Charters (county) loosened 

College (private) tax-exempt 

Democrat for governor 

Disabled veterans 
tax-exempt 

Exhibit hall bonds 

Ex-felon voting 

Government water state 
control 

Hospital bonds 

Hospital employee benefits 

Legislators’ pay increase 

Land title law 

Needy aged aid 

Nonlawyer judges ok'd 

Parking facilities by state 

Recreation center bonds 

School bonds 

Sheriff's employee benefits 

Supervisors' pay 

Terms of office 

Vernon City charter 

Vessel (carrier) tax-exempt 

Vessel (fishing) tax-exempt 

Veterans bonds 

Welfare institutions 
tax-exempt 
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FIGURE 7.6 

Spherical configuration showing the relati 
as voted on by 200 random San Francisco 
election of 1954. 


onships among 31 election issues 
neighborhoods (precincts) in the 


y of issues on which the small neighborhoods 


(precincts) of a city vote meaningfully group? An empirical key-cluster factor- 
ing followed by cluster structure analysis gives a clear answer. The data con- 
sist of the 1954 voting in San Francisco of 200 random precincts drawn from 
more than 1,300 precincts. The issues studied were 30 state and city propo- 
sitions plus the vote for governor. The code names and brief descriptions 
of the 31 issues are listed in Table 7.7 and for ready reference they are 


clusters do the wide arra 


repeated alphabetically. 
As one reads through the list it is difficult to think up any reduced 


set of meaningful rational categories into which to cast all 31 of these issues. 
However, the cluster structure of the issues becomes clearly revealed 
pictorially in the SPAN configuration of Fig. 7.6 and in the statistical uni- 
factorial allocation of them given in Table 7.8. The tight collinear cluster 
marked P at the left on the SPAN diagram is rather sharply differentiated 
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TABLE 7.8 CLUSTER AND FACTOR STRUCTURE: THE VOTING-ATTITUDE STUDY 


(OBLIQUE UNIFACTOR STRUCTURE IN CSA) 


Variables 


Сі, political 

statism, P 
5 NdAg 
30 ShEm 
29 HsEm 
1 DmGv 
18 Pk 
9 Vsl2 
13 ЕХН 
28 AdOf 
14 VrCt 
27 SpPy 
Jl ТОР 
20 NnLw 
31 CbCr 
4 AlCn 
Domain validity 
Reliability 

С», taxation, Т 
24 HsBn 
25 AgBn 
22 ExBn 
23 RcBn 
3 ScBn 
2 VtBn 


21 Chrt | 


26 ВІЈЬ 
8 LnTt 
17 GvWt 
19 АПР 
12 DsVt 
6 Vsll 
7 LgPy 
Domain validity 
Reliability 
Сз, ethnic- 
religious, E 
16 WIEx 
10 ChEx 
15 CoEx 


Domain validity 


Reliability 


Definers 


ооо 


ооо 


оро 


Oblique 
Fe 


96 
92 
90 
86 
83 
82 
76 
63 
61 
56 
53 
51 
51 
50 


83 
82 
80 
77 
73 
66 
64 
60 
59 
58 
56 
46 
45 
36 


87 
81 
78 


h? 


93 
85 
80 


A 


7 
69 
72 
42 
40 
57 
29 
34 
42 
44 


71 
72 
67 
67 
53 
44 
60 
38 
71 
50 
46 
23 
22 
22 


77 
70 
66 


73 
68 
56 
55 


47 


45 
45 


64 
63 
62 
59 
56 
51 
50 
46 
45 
44 
43 
35 
34 
28 


72 
67 
65 


A. Inner Structure of the Clusters 


Reliability 
Cumula- 
Single tive 
96 96(1) 
95 96(3) 
95 96(2) 
93 96(5) 
93 95(6) 
94 96(4) 
92 | 95000) 
92 | 950) 
93 95(8) 
93 | 95(7) 
90 | 90(1) 
90 92(3) 
89 93(6) 
90 | .91(2) 
89 | 93(4) 
89 | .93(5) 
87 | 930) 
87 93(8) 
87 92(9) 


Cluster 
Score 
on D's 


97 
94 


94 
88 


94 
87 
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TABLE 7.8 CLUSTER AND FACTOR STRUCTURE: THE VOTING-ATTITUDE 
STUDY (OBLIQUE UNIFACTOR STRUCTURE IN CSA) (Continued) 


B. Generality of Clusters 


с, с. с; 
Reproducibility of mean squares of correlations 50 | 42 31 
48 | 44 | 38 


Reproducibility of communalities 


C. Original and Interdomain Correlations between Clusters 


Interdomain (cos 0) 


| Original | 

|| G | & | с | с с. | с 
с. | (90 o | 2 1.00 o | 32 
С; 09 (88) 55 09 100 63 
са | 29 55 C87) 32 63 | 1.00 


from the almost completely independent cluster marked T on the right, 
whereas a third small cluster above labeled E is correlated with them both. 

What do these three cluster-defined dimensions of social attitudes 
mean? The circled definers of the P cluster at the left consist of DmGv, 
Democrat for governor; HsEm, hospital employees benefits; Pk, parking 
lots operated as a state business; and NdAg, needy aged aid by state 
pensions, the old California “ham and eggs” perennial ballot issue that 
Eoes back to the Great Depression. No profound conceptualization is 
required to recognize this as the dimension of political statism vs. rugged 
individualism, one that is commonly believed to split Democrats from 
Republicans. Note that the circled second cluster T at the right consists 
entirely of bond issues that are supported by property taxes. There seems 
to be no doubt that this dimension means a readiness to suffer taxation 
for community enterprises. The correlated cluster E above consists of three 
issues that seek approval for relief from taxation of certain religious and 
ethnic institutions, hence designated as the ethnic-religious dimension. 

The data on the inner structure of these three clusters, given in 
Table 7.8, show that all three have а reliabilities of the order .90. The politi- 
са! and taxation dimensions have the most generality, with political the 
largest. The correlations among these clusters show what is also revealed 
in the SPAN configuration, namely, that the 180 main clusters are quite 
independent. (This finding comes as a surprise to some Democrats who 
expect heavily Democratic neighborhoods to tend to support taxation for 
community works. lt still appears. however, that nobody, whatever his 
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political hue, really loves the tax collector when voting in the sanctum of 
the election booth.) The fact that the ethnic dimension is more closely 
related to the taxation dimension than to the political dimension suggests 
that perhaps a tax money pinch has a stronger adverse effect on sympathy 
for minorities than a political conviction that one ought to support them. 

How this empirical analysis, followed by revised factoring, produces 
the above findings is seen in the details of Table 7.9. At the bottom of the 
Empirical Factoring columns the third and fourth dimensions are shown 
to contribute only a small common variance as orthogonal dimensions. 
The definers of the fourth dimension do not lend themselves to ready 
interpretation. In the revised factoring the fourth dimension was dropped. 
The deletion and addition of definers to the first and second dimension, as 
shown in the column under Definers, was largely in the interest of increas- 
ing the independence of the two sets of definers or in increasing prima facie 
meaning. The three-dimensional solution under Revised Factoring gave 
about the same fit as the initial empirical four, and though the residuals 
are increased a little, they are quite trivial. 

The general conclusion from this analysis need not be labored. Small 
neighborhoods vote mainly in terms of two general attitudes that differ- 
entiate them, P, political statism, and T, taxation. The attitude structure 
appears to be more sharply differentiated than the demographic. Almost 
all the 31 issues on which the people voted polarize around the two attitude 
clusters. On the other hand, many of the demographic characteristics 
form a graded series through the socioeconomic gradient. Perhaps this 
comparative result is to be expected. Attitudes are organized conceptions 
in the minds of people supported by strong emotions and are thus subject 
to simplifying stereotypy. Demographic characteristics, however, are not 
“mental” phenomena but the objective expression of a multitude of uncon- 
trolled social and biological forces not so easily subjected to simplified 
structuring. 

At a later point we score neighborhoods on the three voting-attitude 
clusters to find out how well variations between the neighborhoods in these 
attitudes are predictable from their three objective demographic scores on 
the socioeconomic independence, family life, and assimilation clusters. 
Since actual scores on individuals should be as prima facie meaningful а5 
possible and should representatively sample a cluster domain, we change 
the definers of P, political, by deleting Pk, parking facilities by the state, and 
replace it by the broader issue, ЕхҒІ, one that favors giving ex-felons back 
their vote after paying their debt to society. For the T, taxation, cluster, one 
bond issue VtBn, veterans bonds, is deleted because the votes by the large 
block of veterans might be more a function of self-interest than of atti- 
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tudes toward taxation. The E, ethnic, definers аге left as they stand. These 
changes do not (as it happens) in any substantial way alter the statistical 
description of the structure (as computed by a rerun of CSA). 


Domain sampling theory of structure 


The term cluster analysis has one unfortunate connotation: it may imply 
that variables of a study should group themselves nicely into tight clusters; 
however, on domain sampling theory no sharp clustering is necessarily to 
be expected. The issue goes to the question of the basic causes of indi- 
vidual differences. In the domains of Holzinger abilities, for example, the 
causes of individual differences among the children are the thousands of 
genes lying in hosts of loci in many chromosomes; combinations of these 
determinants form the many varieties of genotypes that constitute the 
observed group of children. These genotypes are complexly influenced by 
innumerable varying environmental fields in which they develop and which 
they seek out. The 24 tests sample 24 different domains of these basic 
determinants, organized into five rationally designed special stimulating 
types of selected situations. As a consequence, these 24 selections produce 
the configuration in the SPAN diagrams of Figs. 7.3 and 7.4. But if different 
selections had been made, a different configuration might have been dis- 
covered. Ideally, if all varieties of abilities one could possibly think up had 
been used in the study (pity the children!), the configuration on the SPAN 
diagrams would have been a vast swarm of points grading across the 
surface of many spheres. 

Similarly, the configuration of demographic characteristics of Bay 
Area neighborhoods seen in Fig. 7.5 is that of only one selection of many 
possible demographic features. With ingenuity (and a large budget) one 
might be able to find demographic characteristics that could fill any pres 
ently empty place in the configuration. Indeed, we find that in the socio- 
economic gradient characteristics spread across the region like a Milky 
Way. The lacunae between this galaxy and the family life cluster may be 
due simply to failure to think of or presently be interested in demographic 
features that could fill these empty places. In the voting-attitude study, 
the differentiation into the general cluster groups seen in Fig. 7.6 does not 
necessarily mean that no social attitudes exist in people that could fill the 
empty places presently evident between the three cluster groups. Issues 
that would have elicited such attitudes happened to be absent from the 
1954 ballot. 
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Generally then, one would not necessarily expect to discover sharply 
different clusters of attributes in any study, but if such a structure does 
exist, one would not necessarily impute to any of the clusters the property 
of it having a primary fundamental character. The clusterings that do 
appear result from complex selective forces at work at the time a study 
is undertaken, forces in the thinking of the investigator that cause him to 
measure some attributes and not others or social and biological forces 
currently at work generating special correlations between some attributes 
but not between others in the particular samples of subjects being studied 
(see Tryon 1935, 1940). On another occasion in time, or under the aegis 
of another investigator of a different theoretical bent, or in another quite 
different sampling of subjects, quite a different set of forces might generate 
a different configuration. 

In contrast to this pluralistic theory of a vast number of biological 
omponents being sampled selectively in any par- 
ticular study there is the opposing monistic theory of one "underlying 
factor" with a few secondary special components (Spearman, 1927) or the 
neomonistic theory of a few general factors (Thurstone, 1935). The latter 
doctrine reached its most sophisticated form in the belief іп a ‘simple 
structure" (Thurstone, 1947, chap. 14). There has been much confusion 
in efforts to understand this vague doctrine of simplicité. It should, how- 
ever, be clear if its meaning is considered in terms of the configurations 
on SPAN diagrams (also see the relatively clear statement in Harman, 


1967, pp. 112ff). 
The simple structure t 


and social domains or С 


heory is as follows. If variables were proper 
measures, they would be of “complexity 1,” i.e., each would be a “риге” 
measure of one factor only and not of more than one. Such pure tests 
would therefore have a clear unifactorial structure and lie in tight clusters 
at corners of spherical triangles. The primary чыз ог ахез аї 1һе 
corners might be orthogonal, whence the configuration would be termed an 
"orthogonal simple structure." But they are more likely to be correlated, 
in which case they would constitute an ‘‘oblique simple structure." With 
actually observed attributes of people, it seems impossible to find pure 
variables. One must therefore settle for the expectation that variables 
should be of complexity 2, i.e. lie in planes along the arcs connecting 
corners of the configuration, Or be of greater complexity, i.e., lying in 
hyperplanes. However, variables should, a properly И, all be 
encompassed within the triangular figure, п which case their Тар! coeffi- 
cients on the “ргітагу” dimensions at the corners are all positive, a shape 
of things called a “positive manifold." Many efforts have been made to 
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spell out the analytic conditions by which such a simple structure can be 
computationally identified (see Harman, 1967 pp. 290ff), but there does not 
appear yet to be any agreement on the matter. The belief in a parsimonious 
set of underlying factors that would generate a simple structure appears 
to be unsupported by any of the basic biological or social sciences. 

A few special points are now made here on aspects of structure 
analysis which were not treated above, because they would have епсит“ 
bered the flow of thought. 


The clustered correlation matrix 


From knowledge of the oblique structure one can arrange the original 
correlations among the variables so that the correlation matrix itself 
reveals the cluster structure. Such a clustered correlation matrix presents 
a meaningfully organized set of data which would otherwise be à chaotic 
mass of numbers. 

In the clustered correlation matrix the variables along the borders 
are arranged by clusters. The order of the clusters themselves can usually 
be meaningfully arrayed in accordance with their structural interrelation- 
ships (as revealed in CSA and SPAN). Often the cluster order, however, 
is arbitrary. The actual process of clustering the matrix is performed in 
the BC TRY System by one quick pass of a special program called REDE 
(which means REorder, DElete, or reflect rows and columns of a matrix). 

To demonstrate how meaningful the clustered matrix is, Table 7.10 
shows the clustered matrices of the three studies with which we have been 
dealing. In the abilities study, Sec. A, the ordering of the variables comes 
from our final decision on the definers of the V, S, F, and M clusters and 
on the configuration of their relationships (as revealed in the SPAN con- 
figurations of Figs. 7.3 and 7.4). As a result, the submatrices along the 
principal diagonal show relatively elevated correlations among the definers 
of clusters. The definers of each cluster are highly collinear, as described 
and illustrated in Chap. 4 (see Figs. 4.1 to 4.3). The relations between 
clusters are also clearly revealed in the cross-correlation submatrices 
sectioned off in the off-diagonal submatrices. 

We have included all the variables of the three studies in their clus- 
tered matrices in Secs. A to C of Table 7.10 in order to have in one place 
the full set of data on which the various analyses given in this book are 
based. We do not discuss these three matrices in detail but summarize 
below the salient advantages of the clustered matrix, which can be сог“ 
firmed by a study of the three matrices in Table 7.10. 
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The clustered correlation matrix has the following advantages: 


1 The final results appear in an intelligibly organized correlation matrix, not 
difficult even for a neophyte to understand. The format is one in which the correla- 
tions or selections from them can be published effectively or otherwise presented. 

2 The matrix is itself divorced from the derivative factoring process and from 
the complications of geometric and statistical oblique structure analysis, procedures 
that may be difficult for some people to understand. 

3 The clustered matrix includes observed correlations and thus represents 
the reality of correlated variations among the variables, whereas the configurations 
from SPAN refer to theoretical augmented common factor correlations. For example, 
in the abilities study, though the correlation matrix tells the same relative story as 
SPAN, it shows plainly that there is in fact considerable specificity in the observed 
variations in the abilities; except for the verbal cluster, even the correlations among 
the definers are rather low. 

4 The meaning and inner cluster structure of each of the clusters emerges 
readily from a study of the submatrix of defining variables near the diagonal and 
from their relations to other cluster definers as given in the cross-correlation sub- 
matrices away from the diagonal. 

5 If one is interested in an illustration of the derivation of the various formulas 
used in oblique cluster structure analysis, the clustered table with communalities in 
the diagonals organizes the coefficients in precisely the form used in these formulas, 
all of which are variants of the "correlation of sums" formula (Guilford, 1950, рр. 
586ff). For example, the domain validity and o reliability of a cluster score come only 
from its submatrix of definers on the diagonal. The common factor (interdomain) 
correlation between two clusters requires, in addition to their two diagonal sub- 
matrices, only their off-diagonal cross-correlation submatrix; the real correlation 
between the cluster score involves exactly the same data but with unities in the 
diagonals. The oblique factor coefficients of the n variables on a given cluster domain 
involve only the columns of correlations of its definers, as grouped together in the 
matrix. The index of generality (reproducibility) of a cluster requires these same 
columns and, in addition, either the total of the squares of correlations over the 
whole table or the sum of communalities across all n variables. Chapter 12 presents 
the formulas necessary to calculate each of these statistics. 


Physical model of the spherical configuration 


Investigators who have difficulty seeing the SPAN configuration in three 
dimensions usually find the matter solved when the configuration is trans- 
ferred to a real globe or balloon. The printout of any SPAN set gives the 
coordinates by which each point on the globe can be quickly located by 
triangulation. 

Relation between two different indexes 


of collinearity, P* and cos 0 


We use two different indexes of collinearity in separate places in objec- 
tively locating clusters having an inner structure of high collinearity in the 
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full cycle analysis. Selecting the most collinear subset of variables by the 
mutual collinearity criterion during key-cluster factoring of the first dimen- 
sion involves the use of P? as the index of collinearity in the correlation 
matrix. However, in gauging the collinearity of variables in structure analy- 
Sis, we use the nearness of variables as points on the sphere. Nearness 
is represented by the arcs between variables, and the arcs in their turn 
are determined by the central angles between them, more specifically 
by the cosines of the central angles. These cosines are the common factor 
correlations between the variables. 

The problem is this: Do these two indexes tell the same story? We 
find out simply enough by comparing them in each of the three illustrative 
studies, as follows. In the Holzinger study, for example, for each variable 
we list its P? values with all the other 23 variables of the study, an option 
in DVP41. This series represents the collinearity of a variable's pattern of 


correlations with the 23 patterns of the correlations of the other variables. 
Beside this listing we write the corresponding values of the cos 0 of the 
variable with the other 23 as computed in the CSA component: these 
are the cos 0 values in the geometric model of SPAN (computed in CSA by 
opting the correlation matrix to be reproduced from the unaugmented 
factor coefficients, from which the SPAN figures are produced, and then 
calling for the augmented correlation matrix, the cos 0 values, to be cal- 


culated on the reproduced matrix). 
With the two separate sets of indexes listed (key-punched) side by 


Side, we input them as two variables into RSCAT in order to determine the 


relation between them. The results are shown in Fig. 7.7 for the Holzinger 
Study (top) and for the social-area and voting-attitude studies (bottom). 
In all three studies there is an obvious very high relation between the two 
indexes, the correlations between them being respectively .92, .90, and .94. 
Since /?? (written PSQ іп RSCAT and in Fig. 7.7) is computed from the origi- 
nal correlations whereas cos 0 comes from the reproduced correlations, 
which deviate from the original values by the residual values, we cannot 


expect the scattergram correlations to reach unity unless all the residuals 
found are probably smaller than the values 


are exactly zero. The values i 
dual errors in the cos 0 values аге not 


Possible, considering that the resi 
exactly zero. 


The conclusion is that when one looks at a SPAN configuration, a 


representation of cos 6 values, one is looking at collinearity relations among 


the variables which, relatively speaking, are almost precisely those in the 
as indexed by the P? values. They even match in abso- 


note that when either is above .90 so is 
tudies the definers of oblique dimensions 


correlation matrix, 
lute magnitudes for high values: 
the other in most cases. Іп many 5 
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In three studies the relationship between P? and cos 0 as separate indexes 
of the degree of collinearity of the variables with each other. 


usually have P? values with each other of the order at least .90. In Table 4.2, 
for example, see the diagonal average P? values of the definers of clusters in 
the Holzinger study. 


Structure analysis by alternative methods 


This section is concerned with the comparison of some of the main methods 
of factor analysis in describing the structure (“ограпігайоп” or ‘‘configura- 
Жоп”) of the relations among variables of a problem. 


Structure from key-cluster factoring 


As an anchor analysis, we use the configuration of the 24 Holzinger abilities 
described by key-cluster factoring and displayed on the SPAN diagram, 
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shown in Fig. 7.8, upper left sphere. The configuration there is virtually 
identical with that given earlier in Fig. 7.3, deviating from it only by virtue 
of minor revisions of the definers of the speed and memory clusters. 
Despite the revision, the configuration shown on the surface of the sphere 
has remained relatively invariant. 

Circled in broken lines are the defining variables of the dimensions. 
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The orthogonal dimensions themselves аге, of course, 90? apart, as shown 
by the symbols Сі, Cs, апа С; in boxes at the three vertexes of the large 
spherical triangle drawn in long broken lines. Recall that four dimensions 
are required in the Holzinger problem. We avoid presenting a second 
sphere, however, that includes the fourth dimension (as in Fig. 7.4) by 
showing those variables which lie in the fourth dimension also on the sur- 
face of this single sphere, designating their loci as Х'5, meaning that they 
project into a fourth dimension, С;, and аге therefore not shown in this 
three-dimensional figure. The six tests that lie in the fourth dimension are 
memory tests (M), shown in the center of the figure. 


Orthogonal factors vs. oblique factors 


In order to compare structure analyses performed by different methods, 
we must define or recall several important terms common to all such 
analyses. Orthogonal factors are the independent dimensions or axes 
derived from residuals by the factorization processes. These points are 
termini of the three orthogonal axes that pass through the origin of the 
sphere (see Fig. 7.2). Oblique factors are correlated dimensions or axes 
whose termini are not orthogonal but are, for example, at the centroids 
of the definers (circled by short broken lines in the figure) represented in 
the figure by the circled points. They are oblique because the scores of 
children on these three clusters would correlate positively. The values of 
the correlations between them are the "'interdomain correlations" com- 
puted by the CSA component of the BC TRY System. Note that the three 
oblique factors are connected by dotted lines, forming an oblique sphe rical 
triangle that describes the oblique factor structure in this problem. 

The larger, outer spherical triangle connecting the termini of the 
orthogonal axes forms the orthogonal factor structure. The orthogonal 
structure is not a very meaningful one because the second and third 
factors, С» and Са, are outside the configuration, and аге therefore not 
directly defined by actual tests, whereas the oblique factor structure is 
meaningful because all three oblique factors are well defined by actual 
tests, namely, by the definers of the three oblique clusters. 

The important concept to stress here is the distinction between 
factors and configuration (or structure). The configuration is the general 
organization of the variables as represented by the structural swarm of 
points; factors are any loci in the configuration that one may wish to con- 
sider critically important and through which one may wish to pass axes. 
The selection of factors does not, of course, alter the configuration in 


any way. 
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Rotation of factors 


It is important to stress that the orthogonal factors in this sphere do not 
pass through tests whereas the oblique factors are in the middle of highly 
collinear clusters. These two oblique factors may be thought of as ‘‘rota- 
tions" of the orthogonal factors to the oblique positions. Thus, the С: clus- 
ter centroid may be thought of as an oblique rotation of the orthogonal 
axis C», a process symbolized by the arrow in the figure. The actual rotation 
of key-cluster factors is unnecessary because discovering and locating 
oblique factors is "'directly" achieved during factoring by identifying the 
oblique factors as the collinear subsets of variables selected by the mutual 
collinearity criterion of key-cluster factoring. But dimensions derived by 
other forms of factor analysis, such as principal-axes factoring, ordinarily 
must be rotated if they are to be meaningful, as we shall see below. 


Structure from principal-axes factoring 


Just as a key-cluster dimension (or factor) is чепей Буға süpset Si vari- 
ables selected from all the ” variables, а principal-axis dimension (or 
factor) is defined on the total set of all n variables. The score of a child on 
the first key-cluster factor, for example, is his composite verbal score on 
the subset of only four verbal tests, v5, V6, V7, V9, whereas his score on the 
first principal factor is his composite score One total set of scores on all 
24 tests of the study. The first principal axis 15 graphically represented in 
the sphere at the upper right in Fig. 7.8 as 18 point Р; in the middle of 
the configuration. The sphere plotted there is a direct tracing of the SPAN 


Sphere defined by the first three dimensions of the principal-axes solution 
computed by program FALS (Factor Analysis Least Squares) of the BC TRY 


System. 
A score on a principa 
Weighted sum of scores on all 24 


laxis factor сап be thought of as a specially 
variables. The first three such principal 


axes are represented in the Holzinger problem by tne БИ, и о Р; аї 
the vertexes of the large spherical triangle pond AH shots oFoennes IN 
the upper right sphere of Fig. 7-8. The point P: lies at that Age in the 
Configuration such that the sums of squared deviations of all 24 points 
from it are a minimum. The least-squares weights attached to each of the 
24 standard scores are such as to place pud the dead snr of the total 
Configuration (including the fourth dimension). A point ip any other 
Place would have a larger sum of its 24 ioni deviations. One computa- 
tional procedure that locates the position of 7 1 and computes the factor 
coefficients of all 24 tests on P, starts its solution by setting the locus of 
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P, by “trial values," then ''waggles" the trial point, as it were, in such а 
systematic iterative fashion that at each successive position its sum of 
squares of 24 deviations is less than on the previous trials. The iterative 
process stops whenever the point P, settles down оп a given locus (within 
an arbitrary convergence criterion). 

The second principal axis P, is at right angles to the first, as shown 
in the figure. Trial values of ^» can be thought of as the termini of spokes 
of a wheel the axle of which goes through P, and the origin of the sphere. 
The finally selected locus of ^» is at the end of that spoke from which the 
sums of squared deviations of all 24 test points is a minimum. In locating 
the third principal axis, a point is selected from all possible points that are 
90° from both P, and P; there is only one such point for which the sums 
of squares of the 24 deviations is a minimum. That point is 5. And so the 
process continues until factoring is terminated. 

Though this solution for principal factors may sound elegant mathe- 
matically (Harman, 1967), the meaning of the principal factors, such as 
P, Ps, and Рз, is usually obscure. Thus, in this Holzinger configuration, 
though a total-set score on the first principal axis P, can be meaningfully 
interpreted as "general intelligence,” the meanings of P» and P; are quite 
ambiguous. The reason is that there are no actual tests with high projec- 
tions (factor coefficients) upon them, so that we do not know what such 
factors can mean rationally. For this reason factor analysts rarely estimate 
such meaningless scores of subjects on principal factors after the first. 
They “rotate” them first, as in the quartimax and varimax solutions 
discussed below. 


The configuration from principal axes 


Though the principal axes may be rationally meaningless as dimensions, 
the configuration of the 24 tests plotted from the factor coefficients on the 
principal-axis factors beautifully portrays a structure that is virtually identi- 
cal with that given in the key-cluster sphere. The oblique factor structure 
on the principal factor sphere, denoted by the smaller, inner spherical 
triangle plotted in dotted lines in Fig. 7.8, is in essence the same as that 
directly given without rotation in the key-cluster solution shown in the 
sphere at upper left. In short, the oblique cluster structure can be derived 
from the principal-axes solution, provided, of course, one has a computer 
program like SPAN that can depict the configuration from the orthogonal 
principal factor coefficients. 

In one sense, the configuration plotted from principal-axis factors is 
slightly superior to that given by the key-cluster solution. More variables 


» 
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are likely to be represented on the principal-axis sphere defined by Pi, 
P., P, than on the first key-cluster sphere defined by Cı, Cs, Сз. For exam- 
ple, note that in the principal axis sphere there are only three memory 
tests “lost” into the fourth dimension, i.e., marked X, whereas five are 
lost into the fourth dimension of the key-cluster sphere. But superiority 
in this particular is trivial in most problems compared with the more sub- 
stantial advantages of key-cluster factoring, whose factors point in the 
general direction of oblique clusters, whereas the principal-axis factors 
point at no meaningful places in the configuration. Another disadvantage 
of the principal-axis solution is that some factors are "negative" dimen- 
sions (on which all variables may have negative factor coefficients) that 
must be reflected in order to provide a configuration that is meaningful. 


Structure from rotated principal-axis factors: 
the quartimax and varimax orthogonal factors 


One way to develop meaning of the principal-axis factors is to “rotate” 
them to places in the configuration that may give them meaning. Two 
puted by the BC TRY component GYRO and 
plotted graphically on SPAN spheres, are the quartimax rotation (Neuhaus 
and Wrigley, 1954), depicted in Fig. 7.8, lower left, and the varimax rotation 
(Kaiser, 1958), given in the lower right sphere. Inspection of these two con- 
figurations indicates that they are the same configuration; indeed, they are 
exactly the configuration given on the principal axes in the upper right 
Sphere in which the axes Pi, P» Р; have merely been swung to the new 
Positions designated as 01, Оз, Qs respectively, in the ашап та зрћеге, 
and as Vi, Vs, Va respectively, in the varimax sphere. Nothing important 
is different in all four spheres of Fig. 7.8. The same oblique factor главне 
is depicted in them all, being denoted by the dotted ШЕШЕ spherical tri- 
angle bounded by the verbal, speed, and Space clusters, with the cluster 
of memory tests represented as projected into a fourth dimension; The 
Noticeable but unimportant difference is that the orthogonal dimensions 
have been put at different places in or around the configuration in the 


four solutions. 


The orthogonal factors (01, ma 
close to those in the key-cluster solution. From the origin of the sphere, the 


first factors in both, namely, Q: and Сі, point as vectors at the oblique 
verbal cluster. The second, Qs and С», point at a place as close as possible 
to the oblique speed cluster while still being orthogonal to the first dimen- 
Sions; similarly with the other two of the four dimensions. Generally, the 


Such methods of rotation, com 


Q», Qs of the quartimax solution are fairly 
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key-cluster factors are better pointers than the quartimax, for the simple 
reason that in the key-cluster factoring procedure the cluster-search 
routines are expressly designed to be the best pointers, whereas the 
quartimax solution is a general analytic procedure not expressly designed 
to locate clusters but one which nevertheless tends to do so. 

The varimax factors, Vi, Və Vs, are worse placed and hence less 
meaningfully placed than the quartimax in relation to the oblique cluster 
structure bounded by the verbal, speed, and space clusters. None of the 
varimax vectors directly points at oblique clusters. The quartimax factors 
start off excellently as pointers but do the job decreasingly well as one 
moves from the first to the second, and thence to the third, and so on. In 
short, quartimax, like key-cluster factors, orders the factors by importance 
or weight in such a fashion that the earlier, more heavily weighted quarti- 
max factors, being rotations of the more heavily weighted principal axes, 
are more meaningfully interpreted. But varimax rotation, on the other 
hand, tends to emphasize the meaningfulness of the least important prin- 
cipal axes as much as they do the more heavily weighted. The consequence 
is that varimax represents the total configuration on its sphere less well 
than quartimax, shown by the fact that no less than nine (one-third) of the 
24 tests are lost into the fourth varimax dimension, whereas only six are lost 
in the quartimax (and key-cluster) sphere, and only three in the unrotated 
principal-axis sphere. This disadvantage of varimax is somewhat matched, 
however, by the advantage that the configuration is better presented aes- 
thetically on varimax dimensions than on the other three. The symmetry of 
the oblique structure (the inner, smaller dotted triangle) to the outer ortho- 
gonal broken-line spherical triangle suggests that the rotated varimax 
factors would converge on the oblique and meaningful cluster structure 
if the orthogonal varimax factors were moved inward at the vertexes of 
the triangle. 

Conclusion 
The general conclusions on structure derived by different forms of factor- 
ing and rotation is that the orthogonal dimensions from any particular 
factoring method, whether key-cluster, principal-axes, quartimax, уагітах, 
or any other (there аге many, e.g. Thurstone centroid, diagonal, canonical, 
bifactor, averoid, alpha—all available in the BC TRY System) are, а5 such, 
not uniquely critical in describing the oblique, meaningful structure of the 
variables. These different forms of factor structure describe the same 
configuration, as is obvious in the four spherical representations of Fig. 7.8. 
Where orthogonal axes are located in the configuration is unimportant. 
But some method of factoring must be used in order to set up the 
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scaffolding, as it were, on which to hang the invariant configuration. The 
key-cluster solution is generally preferable because its factors are designed 
to point at or near natural clusters among the variables, and, just as impor- 
tant, the program routinely leads to the program CSA which describes 
directly, and with metric precision, the nature of the oblique factor struc- 
ture as depicted graphically by the dotted triangles in Fig. 7.8. But, if an 
analyst has a predeliction, say, for the principal-axis varimax rotation, he 
can use it with accuracy to describe the configuration. From the resulting 
SPAN diagram he would make a decision on what should be the defining 
variables of oblique clusters. He would then use these definers to execute 
a preset key-cluster solution in order to secure a more precise description 


of the oblique structure. 


Bifactor structure and other forms of factor structure 


that for a given number of dimensions that 
yield trivial residuals, the configuration is the same whatever the type of 
factoring employed may be, itis no longer necessary or desirable to quarrel 
Over which factors are "best" in some fundamental sense. Factors are 
any places іп the configuration that one may want to run an axis through, 
and different types of factoring put axes through different places. It is 
therefore seriously misleading to call any such arbitrarily placed dimension 
an "underlying" factor, as if its position referred to some deep-seated 
or psychological "cause," or "source" trait. To be sure, 
refer to such an entity, but there is absolutely no 
ures of factoring to justify such an inference. 
come from other sources than mere proced- 


In the light of the conclusion 


biological, social, 
Such a dimension may 
basis inherent in the proced 
Evidence for causation must 
ures of factoring correlations. 

of this century, psychology was wracked by 


During the first quarter | 
the question of whether the factoring procedure “proved” the existence 


in cognitive traits of the “/ factor," often referred to as “general intelli- 
igence." The main advocate of this belief, Spearman, always dealt with 
oblique configurations of abilities like those of the Holzinger problem. In 
modern terms, what was being said was that one can put an axis through 
the grand centroids of such configurations, like АЕ principal axis P, 
and call it g if one wishes. Such a positioning Jj an axis does not prove that 
it represents an "underlying." "source" entity. 

The traditional factorial solution of this sort was called, in its original 
awkward formulation by Spearman, the “two-factor theory," but it was 
eralized with some finesse by Holzinger as the ‘‘bifactor 


finally gen ПЕ 
7 chaps. 7 and 8). Today, it is sometimes interest- 


theory" (see Harman, 196 


134 


ing or convenient to perform a bifactor solution, using, say, the BC TRY 
System. If, for example, one wants to remove from the correlations among 
the 24 Holzinger abilities a total-set, or total-score, dimension and study 
the structure of the correlations among them left over after the removal, 
one would call the principal-axis solution (FALS), preset it to one dimension, 
compute residuals (by FAST), and then perform a full-cycle key-cluster 
analysis of the residual matrix. This solution is a bifactor analysis in modern 
dress. 

With the ММРІ items, a common theory is that one reason why the 
items tend to be generally positively correlated is a desire of many respond- 
ents to give favorable responses to them and to reject the obviously 
unfavorable alternative responses. One can think of the first principal axis 
as representing such a ''favorability'"' continuum, remove it from the inter- 
correlations among the items by taking first-factor residuals, and then 
study the matrix of residuals for the clusterings of items “freed” of the 
favorability principal axis. 

These and many other factoring designs can lead to very interesting 
solutions. Another specially useful one is to define dimensions by single 
meaningful variables. This form of factoring is called “diagonal” factoring 
or “pivot-variable’ analysis (an option in CC factoring in the BC TRY Sys- 
tem). All these variants should be thought of not as devices that inexorably 
reveal underlying truths but as various designs to extract variance of 
different, defined sorts from the correlation matrix, leading to different 
ways of deriving meaningful scores on the individuals who compose the 


group being studied. 


Chapter $ 


OBJECT CLUSTER ANALYSIS 


[Бет the time of Hippocrates, and doubtless even before, there has 
been an urge to describe the salient characteristics of a person by 


" Though forming typologies may sometimes 
be decried, it is an important scientific pursuit. The medical profession 
Could not exist without typologies, the medical term being "syndromes." 
In the general field of biology, a typology is called a "taxonomy"; plants 
and animals could not be conceptualized were they not cast into typological 
Classes called families, genera, species, and so on. These groups of 
Organisms are not fanciful constructs, they are genetically differentiated. 
Indeed, in biology the term "type" is recognized as vitally important. An 
Organism's configuration of genetic factors is its "genotype," the chromo- 
Some photograph of which is its "karyotype." The basic scientific problem 
is not really a theoretical dispute over whether typologies ''exist." Rather, 
the problem involves the more difficult and important question of develop- 
ing an objective method of forming a typology and of assigning individuals 
to their proper groups within it. 


There are a number of reasons W 
Since a particular type includes many individuals, a considerable amount 


identifying him with a “type. 


hy typologies are desirable: (1) 
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of information derived from experimental or field observations on the 
individuals accumulates to the type. Any new individual who fits the type 
can be better understood than if no such cumulative information were 


w that becomes quite 
d the other has a very 
tion of individuals who 


ying many individuals into 
butes was forbidding before 
Puter. Computer programs now can do the job in a 
matter of minutes. In the field of psychology such programs were originally 
called “inverse analysis," but now the more appropriate term is ‘person 


cluster analysis" or simply ""O-analysis." In biology, where the classifica- 
tion of Species and varieti 


"numerical taxonomy” 


, 1963). In electrical engineering 
they are called programs of « 


Pattern recognition." 
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which the computer can ''visualize" them in the higher-order space. 
Conceptualized in this fashion, the problem of forming a typology of a 
group of individuals on scores on many attributes is a simple one in prin- 
ciple: one writes a computer program to represent the individuals as a 
swarm of points in a hyperspace of scores and to locate within the swarm 
regions of high density; there are as many types as there are regions of 
high density in the space of scores. In practice, the problem usually is a 
bit difficult because individuals expressed as a swarm of points often do 
not show clear-cut regions of high density. 

Figure 8.1 illustrates a computer-program-discovered typology in the 
form of 15 profile types in the Holzinger problem. The 15 different classes 
of profiles in the figure were actually discovered by the BC TRY program 
OTYPE, which located the 15 types as 15 areas of density in a score space 


FIGURE 8.1 : р 
Profiles of 301 children in 15 core O-types in the Holzinger problem. 
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of four dimensions, these being the four cluster scores on V (verbal), 
S (speed), F (form or space), and M (memory). Each of the four dimension 
Scores was expressed in standard score form with a mean of 50 and a 
standard deviation of 10, as shown in the form of vertical axes in Fig. 8.1. 
The “unique individuals” at the bottom of Fig. 8.1 had profiles too deviant 
to fit any of the 15 types shown above them. 

This illustration points up some of the difficulties in O-analysis. The 
"real" configuration of the profiles is the swarm of points of the individuals 
in hyperspace. One can, of course, actually see such a swarm of points 
and pick up areas of density in it if there are only two dimensions. When 
there are three dimensions, one would observe the swarm within a sphere 
or cube, as in a room. In four or more dimensions there is no easy way to 
"see" the areas of density. Obviously, some profile types are more like 
each other than they are like others. In any actual problem it may be possi- 
ble or necessary to combine some types into higher order clusters. To do 
50, it is necessary to observe the relationship among the different profile 
types and combine those which are most similar. In the Holzinger problem 


FIGURE 8.2 
Spherical structure of the 15 core O-types of the Holzinger problem. 
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the relations among the 15 types are revealed in Fig. 8.2, where the 15 
different types are described by the 15 centroids, shown in the spherical 
configuration. Each of the 15 different centroids is expressed by a point 
on the surface of a sphere, the familiar SPAN spheres described earlier 
in the key-cluster solution of variables in Chap. 7. The configuration of 
these centroids in relation to the four score axes also drawn on the spheri- 
cal surface shows that types 1 and 2 аге rather similar, being close together 
as points. These two types are both “lows” in most of the four dimensions, 
i.e., below the mean. By cross-referencing similarities and differences in 
Figs. 8.1 and 8.2 it can be discovered that these two methods are merely 
alternative ways of describing the total configuration. Profiles that have 
opposite shapes, i.e., are mirror images, lie at opposite sides of the sphere, 
e.g., types 1 and 15. 

The advantage of expressing the configuration of types in the spheri- 
cal representation of Fig. 8.2 is that it can be visualized, whereas it cannot 


easily be seen in the profiles of Fig. 8.1. Though there are actually four 


dimensions, we represent the overall structure in the two-dimensional 


plane of a piece of paper. In many problems typologies in many dimensions 


can be represented on only a few subsets of spheres like the one in 


Fig. 8.2. 
Still another way to describe the relationships among the profiles 


of the individuals or of their centroids is in the form of a ‘hierarchical 
chart" showing the hierarchical structure of the O-types. We turn to this 
matter a little later, but in the meantime Fig. 8.6 shows the hierarchical 
chart of the 15 core O-types of the Holzinger problem. 

To sum up the methodological problem, there are four different ways 
to describe the typology of individuals: (1) types can be described as 
centers of density in score space (Fig. 2.1), (2) пу сап Бе, desctipud аѕ 
profile charts (Fig. 8.1), (3) they can be described by configurations on 
the surface of a hypersphere (Fig. 8.2), and (4) the typological structure 
can be described summarily by a hierarchial chart (Fig. 8.6). All four ways 


have their particular advantages and disadvantages. 


Methods in object cluster analysis 
Deciding on the score space 


formed on the basis of profiles of scores of indi- 
viduals on a given set of variables or dimensions, the first major step is to 


decide upon the dimensions of the score space. They can be the n raw 
scores of all the variables of a study. This form of clustering, sometimes 


Since object clusters are 
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called Q-analysis, is usually, however, an undesirable undertaking. Using 
single variables as dimensions denies one the power and information 
derived from a V-analysis of all the n variables. Except in those studies 
where each specific variable may have an important meaning in its own 
right and should not be included in a composite derived from a V-analysis, 
the dimensions of a typology should be the cluster or factor scores derived 
from a prior V-analysis. A central reason for doing so is to secure the gain 
in generality that such dimensions necessarily possess over individual 
variables. A greater degree of generality in typological prediction can be 
expected where a typology is based on cluster-composited dimensions. 

There are different forms of V-analysis, namely, key-cluster, principal- 
axes, and sundry rotations of these. When oblique dimensions are chosen 
as the dimensions of the score space, the form of factor analysis that was 
used in the V-analysis makes little difference, for, as we showed in Fig. 7.8, 
the oblique structure of the variables is invariant with respect to the type 
of factoring procedure employed. 

Once a decision is made on the dimensions, the individuals are scored 
on the different dimensions. In the BC TRY System, this component is 
FACS (Factor And Cluster Scores), which is capable of computing dimen- 
sion scores on individuals in any form one wishes. These can be oblique 
scores on the dimensions derived by the prior V-analysis or orthogonal 
scores on such rotated principal axes as varimax or quartimax. Whatever 
the type of dimension scores desired, it is usually advisable to convert the 
scores on each dimension to standard score form, with a mean of 50 and 
standard deviation of 10. This conversion means that all dimensions have 
equal weight in determining the typology. If, however, it is desired to give 
greater weight to some dimensions than to others, a scoring component 
such as FACS permits one to weight the dimensions differently. One may 
wish, for example, to weight the dimensions that have the greatest gen- 
erality more heavily than those which are more specific. The more general 
dimensions therefore stretch out the individuals more in score space 
than the less general dimensions. 


The problem of missing data 


Many studies are crippled by the problem of missing data, i.e., by some 
individuals having no scores on some variables. This missing-data problem 
may not be serious in V-analysis because, in preparing the correlation 
matrix for factoring, the pairwise correlations between variables can be 
computed only for those individuals which have complete scores (as in the 
BC TRY correlation program COR3). But in computing cluster or factor 
scores on each dimension, if there is a missing score on any single variable 
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for a given individual, that individual may be excluded from the entire 
typological analysis, because absence of a score on any variable prevents 
his profile or his locus in score space from being computed. There is one 
exception to this principle: when cluster scores in contrast to factor scores 
are being used as measures of the dimensions. Recall that a cluster score 
on any dimension is a composite of the standard scores on a collinear sub- 
set of defining variables [Eq. (2.2)]. These definers of a dimension sample 
the same general domain of variation, are positively correlated, and there- 
fore can replace each other if any one of them is missing. Thus, in the 
FACS program of the BC TRY System, if the standard score on any definer 
of a given dimension is missing, it is replaced routinely with the mean 
standard score on the other definers whose scores are present. Since 
it is unusual for scores on all defining variables of a given dimension to 
efined by cluster scores, itis rare for 
logical analysis. When a score is 
a cluster-defined dimen- 


be missing, when dimensions are d 
any individual to be missing from the typo 
missing on any variable that does not measure 


sion, it is of course of no consequence. 
The problem is serious when the V-analysis is performed by any of 


the orthodox methods of factor analysis such as the principal-axis solution 
any any orthogonal or oblique rotation of its dimensions, for factor esti- 
mates of these dimensions are based on a least-squares weighting of 
scores on all n variables. Hence if any of the п scores is missing for a given 
individual, his factor scores cannot in principle be computed. 


Typology from the overall configuration 


in spherical analysis 
A ЖЫ nada eie 


pology is to do so from a study of the entire 
in which one can see regions of density in the 
When there are more than two dimensions, 


it is not possible to see this configuration in the score space proper. One 
must therefore represent the configuration in some other fashion. The 
device for doing so has already been presented in V-analysis by the general 
procedures of projecting many observations as points on the surface of a 
hypersphere of k dimensions. In the Holzinger problem, for example, the 
configuration is presented in Fig. 8.2 but only as points representing the 
centers of density, or centroids, of individuals. 

Since this form of typological analysis is of crucial importance, the 
detailed logic of the procedure is fully explicated here on a simple fictitious 
problem. Six clusters of 100 individuals are defined by their scores on two 
dimensions, Zi and Z» as depicted in Fig. 8.3. Thus, cluster A consists of 


The best way to form a ty 
configuration of individuals, 
Score space if there are any. 
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Fictitious example Showing 100 individual objects clustered together to 
form six O-types based on their scores on the two dimensions Z, and Z+». 


the eight individuals in the upper left sector of the figure whose numbers 
are 12, 13, 21, and so on. We can represent the eight individuals in cluster A 
by the identification symbols A12, A13, and so on, to the eighth individual, 
A33. Similarly, those of B would have numbers B13, B22, and so on. The 
locus of each of the 100 individuals in all six clusters is determined by its 
coordinates on the two dimensions. These two-dimensional scores are 
listed for the first two individuals in each of the six clusters in the top two 
rows of scores given in Table 8.1. For example, the Z; апа Z, coordinates 
of A12 are shown there to be 38 and 64, respectively. 

The exact arrangements of the individuals in this space are repre- 
sented by the spatial distances between them in the two-dimensional score 
Space, i.e., by their euclidean distances, represented by D. Thus the value 
of D for the first two individuals, A12 and A13, of cluster A is simply 


D = V(38 — 38)? + (66 — 64): = 2 


which is small, as it should be, since these points virtually sit on top of 
each other in this space. 
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The BC TRY System includes a program called EUCO, which can com- 
pute the entire matrix of distances among all the 100 objects depicted in 
Fig. 8.3. This basic matrix constitutes the data on which the typology of the 
100 individuals is formed. The distances among the objects in a given cluster 
will approach zero, but the interspace differences between objects in differ- 
ent clusters will not be zero. The problem therefore is to devise a method 
of locating the centers of density by working with the distance matrix, to 
find the six submatrices in which the distances approach zero. 

We already have the procedures for doing so in full-cycle key-cluster 
analysis explicated in Chap. 7 on V-analysis. We can utilize this method 
provided the distance matrix is converted to a correlation matrix. The 
process of conversion is illustrated in Table 8.1. First, for each dimension 
we select a given set of hypothetical objects that span the dimension. Thus, 
for dimension 21, we choose reference-marker objects, called OMARKs, at 
equal distances (steps of one standard deviation) on the Z; dimension. These 
are the six individuals whose 21 scores are 20, 30, 40, 60, 70, 80 (and whose 
2: scores are perforce 50 at the Z» mean) given in the six rows under 
“Оп dimension 21” in Table 8.1. An analogous set of six OMARKs is selected 
which span the 2 dimension. The final OMARK is at the grand origin whose 
coordinates are (50,50). Now, instead of working the distances between all 
objects pairwise, we can merely work the distances of each object with the 
13 reference OMARKs that span the score space. Any two individuals that 
have the same pattern of distances with all the other 99 objects will have 
the same pattern with the 13 reference objects. This can be verified by 
noting that for the first two objects in each of the six clusters their pattern 
of distances with the 13 ОМАКК5 is virtually identical, signifying that they 
are together in this score space. 

The final aspect of conversion is to compute the correlations between 
the columns of 13 distances of the 100 points in pairs. These correlations 
are given for the first two objects in each cluster in Sec. B of Table 8.1. 
Correlations between the two objects within each cluster are virtually 1.00, 
and as the clusters separate in space the correlations go down to .00 and 
become negative when the coordinates tend to go to mirror images. 

The 100 by 100 distance matrix is thus transformed to a full correla- 
tion matrix. We may then perform a full-cycle key-cluster analysis on this 
matrix, with diagonal values set to 1.00. In this full sequence, the final 
component called is the spherical analysis, SPAN. The actual result is 
shown in Fig. 8.4, where the configuration of objects displayed on the sur- 
face of a sphere is in almost precisely the form they took in the score space 


of Fig. 8.3. ; | | 
The point of all this is that, in general, the configuration of objects 
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FIGURE 8.4 


The six object clusters of 100 individuals projected onto the surface of a 
sphere by EUCO analysis. 


within score space is spread out on the surface of spheres, where it can be 
inspected for cluster structure. If there is a clear cluster structure showing 
definite areas of density, i.e., O-types, they will leap to the eye as they 
actually do in Fig. 8.4, where the six object clusters are immediately evident. 

Thus, EUCO analysis is a form of pattern recognition, in which the 
pattern of objects in k-dimensional space is displayed for recognition on а 
spherical surface. Since it clearly shows the configuration of the objects 
in their interrelationships, one can also discover the hierarchical structure 
among the objects. In Fig. 8.4, for example, clusters D and E are most 
highly related, E is slightly negatively related to A, and so on. The exact 


hierarchical relations are given metrically in the CSA component, where, 


just as in V-analysis, the interdomain, i.e., intertype, correlations are given. 


and where in the oblique factor coefficients the relation of each object to 
each domain, i.e., to each type, is given. 


There is one special difficulty with forming a typology from the SPAN 
configuration given by EUCO analysis. If there is no clear cluster structure, 
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the analyst may have difficulty in making up his mind about the object 
clusters that constitute a proper typology. But if no structure does appear, 
this is a fact that he must live with. In this case, he will quite arbitrarily 
break up the configuration into sectors whose centers of density form 
patterns of scores that are meaningful to him. For analysts who have 
difficulty making up their minds, we need methods (given below) that will 
always provide a finite set of object clusters when cluster structure is not 
clear, methods which discover "natural" areas of density in those regions 
of the configuration where such densities do exist. 

Another difficulty is the expense of EUCO analysis if there are many 
individuals in a study, a likely possibility. To compute a correlation matrix 
between the columns of distance values with six reference points on each 
of k dimensions, say, for 1,000 individuals can be a forbidding undertaking, 
even with modern computers. Though shortcuts can be employed, such 
N on random samples drawn from the full 
supply and increasing the number of samples to secure a converged con- 
figuration that describes the cluster structure of the full supply (Tryon, 
1966), EUCO analysis is really not à practical solution if 15 large. It is, how- 
ever, practical for representing the spherical configuration of centroids only, 
as found by other methods described below, OF dor studying the con- 
figuration of the cluster structure among a more limited number of indi- 
viduals that may lie in a sector of the configuration in hyperspace (Chu, 


1966). 


as forming a composite SPA 


Typology by iterative condensation on centroids 
А practical solution relatively unaffected by the number of individuals in a 
study, and one not requiring large-scale distance matrices or correlation 
Coefficients, is an iterative procedure starting wt trial O-types and ulti- 
mately converging on O-types at centers of density. If B clear cluster struc- 
ture does not exist, the procedure nevertheless provides an arbitrary set 
of O-types. This procedure is essentially a program of pattern recognition 
that discovers a pattern of clusters if such a pattern exists in the cloud of 
individual points in the score space. This new method is illustrated fully 


in real data in later parts of this and following chapters. 

We may simply illustrate the logic of this пегануе method of forming a 
typology on the fictitious problem of 100 individuals organized in six dis- 
tinctive clusters, as depicted in Не. 8.3. Programmed in the component 
OTYPE of the BC TRY System, the method begins with iteration 0, called Zo, 
that selects an arbitrary set of trial “соге O-types from the swarm of 
points. In the two-dimensional space of the fictitious problem it computes 
the coordinates of the centroids of each core O-type and then reassigns 
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each one of the 100 points to the centroid of that particular Sore O-type 
with which it has its smallest euclidean distance. After all 100 points are 
reassigned, each O-type is thus changed; hence in the second pass, /,, the 
coordinates of the new centroid of each O-type are computed, and all 
100 points are again reassigned to the new centroids with which they are 
closest. New centroids are thus formed on which, in 7», all 100 points are 
again reassigned. The process continues until the final iteration, at which 
all 100 points remain unchanged in their reassignment to O-types. 
During successive iterations the trial O-types wander about a bit, 
moving closer to centers of density. During iteration, if two trial O-types 
approach closely enough (according to an arbitrary criterion), they merge 
together to form a single O-type. At any reassignment of objects, if any one 
point is too far away from any centroid (according to an arbitrary criterion) 


itis cast out as "unique," but it can come back and join a cluster if on any 
later iteration it again falls within the criterion. 


The results of this process of iterative condensation on centroids are 


shown in Fig. 8.5 for eight separate OTYPE runs on the fictitious problem. 
Besides the different criteria mentioned above, one of the most critical 
Parameters is the one that Provides the initial starting trial core O-types in 
iteration 10. This is the set of “partitions,” or “cutoff points" on each 
dimension that establishes the initial arbitrary "sectors" in the score space 
Within which the initial core O-types are located. For example, in Fig. 8.5, 
top left, each dimension is dichotomized to form four sectors, hence called 
“бесі-4.” Starting from the four core O-types resulting from 7,, labeled 
Trial 1 in the figure, the trial O-types formed by 7, аге the four 
enclosed in broken lines. At the end of Trial 8 the process converges 
final four O-types shown in unbroken lines: this Sect-4 run therefo 
to converge on the "true" clusters. 
the next lower diagram labeled Sec 
formed by tric hotomizing each dime 
I, the process converges exactly on 
the runs (including Sect-100, not 

Sect-49 and Sect-64, each of which, 
Spurious O-type from the two edges 

A variety of other Parameters 


centroid condensation method, but 
here. 


shown 
on the 
re fails 
In the next run, however, shown in 
t-9, where core O-types are initially 
nsion, J; forms seven O-types, but on 
the true cluster structure. The rest of 
shown) converge on "truth" except 
on convergence, picks up a seventh 
of true O-types D and E. 

affect the convergence process of the 
we cannot discuss the details of them 
here is a clear cluster structure, the 
ds it. The best procedure is to make 


In general, however, when t 
OTYPE component of BC TRY fin 
Several runs under different optio 
O-types does not appear, to input t 
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The converged result is the final set of O-types accepted. To show the 
power of the batched run, when we batch the output O-type cards from 
the erroneous runs of Sect-4, Sect-49, and Sect-64 with the other correct 
outcomes, the single batched run on OTYPE converges exactly on the true 
cluster structure. As a final check on typological structure, the analyst will 
usually run a EUCO analysis of the data in order to compare the results 
of the OTYPE run with those given by the overall configuration in SPAN. 


Typology by other procedures 


It has been pointed out above that another method of representing the 
cluster structure of objects is to form a hierarchical chart of the resem- 
blances between them as measured by the interobject distances. In the 
OTYPE program the hierarchical order of the O-types is also routinely out- 
put as a guide to understanding the intercluster structure of the centroids. 
An illustration is given in Fig. 8.6, which gives the hierarchy of the initial 


Lows 


FIGURE 8.6 


Hierarchical structure of the 15 Core O-types in the Holzinger problem 
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15 core O-types. The meaning and use of the knowledge conveyed by the 
hierarchical graph are discussed later in the chapter. 

Mention should also be made of the ''single-bond'' method of object 
clustering of Sneath (Sokal and Sneath, 1963, p. 180). By this method, the 
individual points in score space are hierarchically, i.e., successively, con- 
densed on the basis of smallest interobject distances. The principle is the 
same as that illustrated in Fig. 8.6, except that individual points and whole 
clusters are merged on the basis of the smallest distances between indi- 
viduals but not between centroids. In the fictitious problem of Fig. 8.3, the 
single-bond principle first forms a hierarchical chart showing that, on the 
basis of smallest distances, the six true clusters are separately formed at 
the same level of interobject distances. 

The single-bond method gives a good hierarchical chart between 
single members of the total configuration. It has the defect of uniting 
clusters into a higher-order cluster only on the basis of a single bond of 
objects on edges of clusters. The result could be that if a true cluster 
includes a long string of points, progressive condensation could follow 
the string and miss the fact that another large cluster might exist nearby. 

The method of average linkage of Sokal and Michener (Sokal and 
Sneath, 1963, p. 182) joins individual points to clusters and clusters to each 
other on the basis of the average distances rather than single bonds. This 
method is therefore virtually identical with that reassignment aspect of 
the centroid condensation method in which each cluster is represented by 
its centroid as defined by the average of the coordinates of all the points 
that constitute it. Since the average distance between a given individual 
and all the individuals that compose а cluster is the same as the distance 
between the individual and the centroid of all the members of the cluster, 
the condensing operation by the two methods should be the same. 


An example: intelligence, the Holzinger problem 


In this section we present an illustration cf an object cluster analysis, a 
typological study, of the 301 grade school children in two Chicago schools 
who took 24 pencil-and-paper tests sampling a variety of intellectual abili- 
ties. The basic data are those already used in previous chapters, the 
Holzinger study. The original scores of all children in both schools are 
reported in the original reference. The tests of intellective capability of the 
study are listed in Table 4.1. The results of an empirical key-cluster analysis 
on the variables of the study are reported in Chap. 6, with the primary 
results of the analysis being reported in Tables 6.1 to 6.3. The revised 
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cluster analysis results and the results of the cluster structure analysis 
of the revised clusters and their scores are reported in Chap. 7. The pri- 
mary results of these analyses are given in Tables 7.1 to 7.3 and Figs. 7.3 
and 7.4. 

For the purposes of this analysis we selected four clusters of vari- 
ables: verbal, V5, V6, V7, and V9; speed, S10, S11, S12, and S13; space, F1, 
F2, ЕЗ, and F4; and memory, M14, M15, M16, M17, and M18. The group 
profiles presented in Fig. 8.1 are profiles of scores on these clusters for 
the 15 groups of children that were selected by the O-analysis procedures 
described here. In the interest of brevity and simplicity we use the symbols 
V, S, F, and M to stand for the verbal, speed, space (form), and memory 
clusters and the corresponding cluster scores. 


Persons in cluster score space 


The first step in O-analysis is to compute scores for each subject on the 
four dimensions defined by the clusters V, 5, Ғ, and M. А variety of tech- 
niques for preparing these scores have been proposed (see Harman, 1967). 
In practice we have found a very high degree of correlation among the 
various methods, and the cluster composite method is the method we 
choose. The cluster composite method gives scores that correspond to the 
cluster composites of cluster structure analysis. The score on a cluster 
dimension is defined as a simple sum of the standardized scores of the 
variables in the cluster, hence a simple additive composite. For example, 
the score of a given subject on the verbal cluster, V, is the sum of the 
Standard scores for that subject on V5, V6, V7, and V9. This composite score 
ordinarily is transformed into Standard score form, e.g., with mean of 50 
and standard deviation of 10. Other transformations are used for a variety 
of purposes, but for our illustration we use means of 50 and standard 
deviations of 10. 

A given individual's profile on the four abilities V, 5, Ғ, апа M is 
defined by his four standard Scores on the four clusters. It is these data 
which are depicted in Fig. 8.1. 

The O-analysis of the subjects in terms of the four abilities of the 
Holzinger study is based on the distances between the subjects, taken 
pairwise, as defined by cluster score dimensions and the subjects' scores. 
For example Table 8.2 shows hypothetical standard scores for two indi- 
viduals A and B and the computation of the distance between them D. The 
similarity between the profiles of A and B is represented by the one value, 
the distance D. D is simply the square root of the sum of the squared 
deviations on the four scores as shown in the computations at the bottom 
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TABLE 8.2 CLUSTER SCORES AND DISTANCE CALCULATION 


у 5 F | M 
= | 
Cluster scores of A 36 44 34 | 46 
Cluster scores of B 37 48 47 35 

Score difference d, A's score 
minus B's score, each cluster —1 —4 —13 11 
Squared difference d? 1 16 169 121 


AR oe D 
D= га: = 4/307 = 17.5 RMS = VE = 8.8 


of the table. For these two individuals D = 17.5. Any computer program 
that can compute composite scores, standardize them, and calculate the 
matrix of distances between all pairs of individuals will do the basic work 
for the condensation method of O-analysis. In the BC TRY System special 
programs are employed, FACS, EUCO, and OTYPE. 

The euclidean distance D given by the equation in Table 8.2 is not a 
metric of similarity that is consistent across studies, since its magnitude 
depends on how many dimensions there are in a study. A useful index that 
is not a function of the number of dimensions is the square root of the 
mean square difference in cluster scores. We call this index RMS. The 
index is a simple function of the distance 


where K is the number of dimensions. 


Core O-types, O-clusters, and unique persons 
from arbitrary sectors in the score space 


A preliminary stage in clustering individuals is to cast them temporarily into 
core O-types based on arbitrary sectioning of the cluster score space. In 
general, the number of sectors is determined by how many broad cate- 
gories one chooses on each of the dimensions, how many dimensions there 
are, and how many subjects are available. In the Holzinger problem, three 
broad categories were used on each of the four dimensions, the result 
being that their joint cluster score space is sectioned into 3* = 81 sectors. 

For studies with limited numbers of subjects it is desirable to keep 
the number of sectors to a small number. The number of sectors is а 
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simple function of the number of categories on each of the dimensions 
and the number of dimensions. If there аге K dimensions and С categories 
on each of the dimensions, the number of sectors 5 is equal to 


S = СК 


The broad categories may, of course, cut the scores at any plate. With 
three categories, convenient cutoff points are at 1 standard челанәп above 
and below the mean. Thus for mean of 50 and standard deviation of 10 
three categories give low scores below 40, middle scores of 40 and above 
to 60, and high scores of above 60. The BC TRY System provides a compo: 
nent OTYPE to do all this work. Standard decisions based on these prin- 
ciples are built into the program, but an analyst can opt for any other cutoff 
points he wishes. If there are more than six dimensions in a problem, itis 
usually desirable to use not more than the most salient six; e.g., in the 
MMPI problem with seven dimensions, only the most salient four are used. 
But, if the analyst wishes to retain more than six dimensions, he is advised 
to perform separate O-analyses on selected sets of dimensions, each set 
including the most highly correlated or rationally related dimensions, not 
exceeding six. 

The next step is to sort the individuals into sectors in cluster score 
space. The results are shown in Table 8.3 where, for comparative purposes, 
data from the analysis of the MMPI are also given. Under "Score Patterns," 
the first sector is LLLL, meaning a score pattern all low (L), i.e., below 40 
on the four dimensions. The column marked Incidence and Core Types 
in the Holzinger problem shows that there are only two children with such a 
generally low profile of scores. The next score pattern, LLLM, is one with а 
profile of scores that are low on V, S, and F, but middle (M) with scores 
from 40 to 60 on the last dimension M. Two children have this profile. All 


told there are 81 patterns, the last of which is HHHH, 


all highs (H) across 
the four scores, i.e. 


‚ Scores above 60 on all four dimensions. There were 
three such children with this pattern. 


А computer program like OTYPE of the BC TRY System is desirable 


for this work, but actually it can be Quickly achieved on an old-fashioned 
card sorter. One sorts all 301 cards on the first two digits of the last dimen- 
sion M, 5 deck is then ranked on the 
n 5 and finally on V. If, before 
f the deck 81 "dummy" cards, 
with one of the 81 score patterns 
mies take a position in the sorted 


of the 81 sectors so that on a list- 
ing of the sorted deck all the cases in the 81 sectors are 


ranking them from low to high. Thi 
first two digits of F from low to high, then o 
this sorting is made, one puts on the top o 
each with standard scores corresponding 
shown in the list on Table 8.3, these dum 
deck which marks the beginning of each 


nicely marked off. 
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Note that there are either no children or a trivial number in men н, 
the 81 sectors of Table 8.3. One must therefore deride, on the eae a 
number of cases to be required in a sector in order for it to be а 
by the label of a core O-type. In the Holzinger problem the xc ed 
arbitrarily set at five children (about 2 percent). In Table 8.3 under Core No. 
the sectors that meet this criterion are numbered in serial order from 1 to 
15. These are the 15 core O-types plotted in Fig. 8.1. 

What is to be done with the 61 children that do not easily fit into these 
15 types? We assign each of them to that core O-type with which it has its 
best fit. The procedure for doing so is to compute for each of these eases 
its distance with the 15 average core O-types and assign it to the one RR 
which it has its smallest distance. Some of the 61 children have such unique 
Profiles that even their smallest distance with the core O-types shows 
their best fit to be poor. Some criterion must be established in order to 
decide when an individual will be excluded from any O-type. A convenient 
uniqueness criterion is objectively set as follows. No object may be a mem- 
ber of a core O-type if the RMS of its cluster Scores from those of a core 
O-type is greater than 1 standard deviation. 

To compute the distances of the 61 individuals from the 15 core 
O-types requires that the mean score profile on V, S, F, and M of each of 
the 15 core O-types must first be computed. These values are those of an 
"abstract" individual that defines a core O-type. The profiles of these 
15 abstract individuals are plotted as filled circles or dots in each of the 
graphs of Fig. 8.1. In the BC TRY System the component program OTYPE 
computed the distances of the 61 children with the 15 abstract types, 
routinely assigned each child to the O-type with which it had the smallest 
distance, and set aside the children who did not meet the uniqueness 
criterion. The OTYPE Program permits the analyst to choose other criteria 
if he wishes. 

The final results of these Procedures are all graphically shown in 


Fig. 8.1. Of the 61 unallocated children, 45 are assigned to the 15 core 
O-types on the basis of small distances. Their Scores are represented in 
the graph by r's. The uniqueness criterion excludes the 16 unique children 
shown in the bottom graphs. 


determine the arbitrary sectors of score 5 


pace and the assigning of indi- 
viduals to them may be all that the analys 


t wishes. 
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Hierarchical structure of core O-types 


If there are no a priori grounds for accepting the established sectors, the 
above procedures may produce more arbitrariness than one wishes to 
tolerate, for there may be natural clusters that have been sliced through 
by the arbitrarily defined sectors. Furthermore, some individual members 
of a core O-type may lie in corners of arbitrary sectors so that in fact they 
may be closer to core O-types in adjacent sectors than to the objects in the 
sectors to which they have first been allocated. Therefore additional pro- 
cedures are needed to free the analysis of such arbitrariness. 

What is needed is a display of the overall structure of the relation- 
ships among the 15 core O-types. If the slices cut through a tight natural 
cluster, the core O-types into which it is fractionated should have similar 
profiles, with small distances between them. If the core O-types are pro- 
gressively combined, two at a time, into higher-order clusters, those which 
constitute a natural cluster would be the first to combine. Progressive 
hierarchical condensation of the 15 core O-types of the Holzinger problem 
gives the chart shown in Fig. 8.6. 

In this chart the vertical scale is the euclidean distance. The first two 
core O-types with the smallest distance between them are shown at the 
upper left, namely, types 4 and 6, whose distance is a little below 11. The 
profiles of type 4 in Fig. 8.1 consist of individuals distinctively low on speed 
and form, hence the symbols SF are placed above type 4 in the hierarchical 
ndividuals distinctively low on S alone. This new 
dividuals whose profiles are low on S but 
w. By such a combination we lose the two 


chart. Type 6 consists ofi 
higher-order type consists of in 
vary on F from middle down to lo 


Core types 4 and 6 but gain type 16 in their stead. 
The progressive condensation of core O-types is shown as one reads 


down the chart. On the left of the chart, headed at the top by Lows, these 
condensations always lead to higher-order types that are low on at least 
one of the four dimensions: at the bottom opposite a D of 22, the grand 
higher-order type 26 consists of all individuals that meet this condition of 
varying from the mid-value of 50 to a very low value on one or more of the 
dimensions. On the right of the chart are the Highs whose final higher- 
Order type 28 at the bottom right of the chart consists of all individuals that 
vary in the higher ranges of the cluster on at least one dimension. 
Reading the chart from bottom up is like observing a breakdown of 
the subjects into a taxonomic structure varying, as it were, from genus to 
Species to varieties—all quite objectively and quantitatively determined. 
Faced by such a chart, one must decide at what level of similarity, as meas- 
ured by D, one wishes to declare his final typology to be. One decision is 
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not to accept any higher-order types but to retain all 15 core O-types as 
depicting the final typological structure. But if it is desirable to deal with 
less than 15, one could reject types 4 and 6, and accept in their stead type 
16, etc. 

The cutoff point of similarity depends upon the purpose of the 
typology. If one wants, for example, a grand higher-order type in which all 
children are at least low in one or more ability and below average in the 
rest, one would accept type 26 (excluding type 9 children who are in the 
middle ranges in all abilities). Analogously, all individuals in the grand 
higher-order type 28 are high in one or more of the abilities. For certain 


Purposes an analyst may wish to deal with two such grand high vs. low 
person-clusters. 
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than in the hierarchy of Fig. 8.6 or in the multiple profiles of Fig. 8.1. The 
Standard score scales of the verbal, speed, form, and memory dimensions 
are set into the configuration so that the standard scores of the various 
types can be read from the configuration. 

In this display a spherical triangle is plotted by long-dashed lines, the 
three apexes of which are designated by squares. These ‘‘corners’’ of the 
triangles are 90? apart as seen from the origin of the sphere. Using these 
Corners as reference points, it can be seen that any two types that exceed 
90? will have profiles that are mirror images. For example, type 5 and type 
13 generally have standard scores departing in opposite directions from 
the mean of 50 and presenting mirror images in the patterns of scores 
across the four abilities. 

This configuration of 15 core O-types actually lies on the surface of a 
hypersphere, i.e., on a sphere of more than three dimensions. However, 
the procedures for developing the configuration are such as to project 
as many types as possible on the surface of one three-dimensional sphere, 
leaving a minimal number to be projected into a fourth dimension, or more. 
How successful the procedure is may be seen from the fact that there is 
only one O-type, 11, lying in the fourth dimension, denoted by an r in the 
figure. This fourth dimension is mainly concerned with form, for the form 
axis lies in the fourth dimension only at its ends, as is the case with core 


type 11, which has a high standard score on F. 


Final types from the profiles, the hierarchy, 
and the spherical configuration | 


Information from Figs. 8.1, 8.2, and 8.6 is used in arriving at the final types 
to retain from an O-analysis. In the Holzinger problem it was finally decided 
not to deal with any higher-order combinations of the 15 core O-types but 
to leave them as they stand. The reason is that it did not appear as if any 
particular cutting through of natural clusters had occurred. The 15 types 
are listed in Table 8.4, being denoted as the final type by the prefix “Т”, 
Thus, in the first column they are symbolized as T1, T2, down to T15. A de- 
Scriptive term is given to each one based upon its distinctive high and low 
mean standard scores. The profile level of each O-cluster is given by its 
four standard scores, listed in the columns headed by Z under Profile 
Level and Homogeneities. The frequency of cases in each O-type is listed 
in the column headed Initial. For example, in T1, described as low verbal, 
low form, is a group of ten children consisting of the original eight in core 
O-type 1 plus two originally unallocated subjects that are assigned to it 
later on the basis of their smallest euclidean distances. The standard 
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Scores of this type are shown in its row of Z values under Profile Level 
and Homogeneities. 

Having made the decision on the final types, we now assign all 301 
subjects to those O-types with which they have the smallest distance. 
Although we assign to O-types individuals that have not been original mem- 
bers of the core O-types, there may remain some members of an original 
core type that may be closer to core types in adjacent sectors. When all 
301 individuals are assigned to those core types with which they have the 
smallest distance, the frequency in the final types may change from that 
in initial O-clusters. For example, T1 has 14 individuals in it after adjust- 
ment, a gain of 4 individuals from adjacent sectors. Type T9 shows the 
greatest change. There were 83 subjects originally in this large average 
core O-type. Obviously, many subjects from outlying edges of it have 
joined up with core types in adjacent sectors. This process of assignment 
and reassignment could be reiterated, but the problems involved in this 
reiterative procedure must be left for later development. 


Multivariate selection in the final types 


Each of the 15 types is a multivariate selection of individuals from the full 
supply of 301 children, because it is based on four different abilities, each 
of which samples several tests. The selection has occurred in two ways. 
First, the profile levels of mean standard cluster scores reveal how dis- 
tinctively the selected groups depart from the average value of 50 in the 
full supply. A further index of selection is the homogeneity coefficient Н 
for the selected group. This Н value is a measure of the “Чірһіпесе” of the 
profiles of individuals that compose a given type. The H value is a function 
of the within-variance of the group's cluster scores compared to the total 


variance of individuals in the full supply. In the case of type T1, its H value 


is formulated 


variance of cluster scores of the 10 children in T1 
By " variance of cluster scores of all 301 children 


When the members of an O-type are identical in their profiles, the within- 
group variance is zero and the homogeneity H is unity. If, however, the 
O-type is a random selection from the full supply, within-group variance 
equals the variance in the full supply and H becomes zero. 

Type T1 consists of 10 individuals whose cluster scores on V are close 
to being identical because its H is .93. It is illuminating to look down the 
column of H values for each of the abilities to see how each of the types 
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has been selected on that particular ability. For a given O-type, the row 
of H values across the different abilities indicates how homogeneous the 
multivariate selected group is on the four different abilities. An overall 
index of homogeneity is given in the last column of Table 8.4; it is merely 
the average of the H values across the four abilities. The greatest overall 
selection has occurred in types T9 and the highly verbal type Т14. The 
least selective is T13, with a value of .79, a type with relatively low selection 
on V and F. 

In the bottom row of Table 8.4 are the unique individuals. Their 
cluster scores reveal that the unique children, as a group, tend to run 
somewhat above the average in the four abilities. Their homogeneity coef- 
ficients are negative. A negative homogeneity coefficient means that the 
individuals are more heterogeneous (greater variance of scores) than the 
full supply, a point to be expected with the full array of unique individuals. 
The square root in the homogeneity equation does not permit evaluation 
of H when the value under the radical is negative. In practice the square 


root of the absolute value is found, and the sign of the value is attached 
to the square root. 


It is desirable to estimate the degree to which each of these O-types 
has profile levels and homogeneities that could be values arising by pure 
chance selection from the full supply. This matter is treated in a later 
chapter in a description of the BC TRY component 4CAST, which estimates 
these probabilities. 

The final typology indicates that there are disjunctive, incompatible 
patterns of scores on the abilities. No compensatory low-high opposite 
extremes of abilities exist. Furthermore, some patterns of extremes in the 


same direction are also absent. Whereas in types ТІ and T2, low verbal 


occurs with low form and with low memory, respectively, low verbal does 


not occur with low speed. At the other extreme we find, conversely, that 
in type 15, high verbal occurs in 12 children with high speed but not with 
high form or with high memory. Just why there should be reciprocal incom- 
patibilities is a fascinating question but beyond the scope of this book. 


An example: personality, the MMPI problem 


Four general attributes, 1, B, S, T, discovered 
by V-analysis of the MMPI items | 


The clinical scales that sample the item pool of 566 items of the MMP! 
suffer from the defect of including items that overlap between scales- 
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The result is that a considerable but unknown amount of redundancy 
occurs in the different scales. Tryon and two of his associates, Kenneth 
Stein and Chen-Lin Chu, made what appears to be the first comprehensive 
effort to develop item-cluster scales that involve no overlap of items. This 
ble before the development of the conver- 
gence method, by which, under the BIGNV procedures of BC TRY, V-analy- 
sis was released from restriction by the number of variables in a problem 
(Tryon, 1966). The importance of this new development is that subjects 
can be scored on experimentally independent item pools and a serious 
effort made to form a taxonomy of person-clusters. This chapter presents 
such a typology, performed by the new condensation method of O-analysis. 

In our own V-analysis of the 556 MMPI items we found seven dimen- 
sions defined by psychologically meaningful oblique item-clusters. Four of 
these seven dimensions accounted for most of the total communality of all 
the items of the ММРІ. About two-thirds of the items were eliminated 
because of trivial communalities. These four item-clusters are briefly sum- 
marized in Sec. A of Table 8.5. Only the item numbers are indicated in these 
four scales, because the MMPI blank is readily available to readers who 
wish to look up the individual items. 

The example developed here presents а complete typology of indi- 
viduals, based on their profiles on these four item-cluster scales, identified 
as introversion (1), body symptoms (B), suspicion (S) and tension (T). 
Since the ММРІ is used in connection with psychiatric examinations, a 
group expressly selected to be heterogeneous with respect to mental illness 
was used in this study. The supply of subjects consisted of 70 psychotics 
(schizophrenics), all with a history of hospitalization within the previous 
5 years, 150 diagnosed as “anxieties” but with по history of hospitalization, 
and 90 normal subjects who were Armed Services officers aichead with 
the 220 psychiatric patients with respect to age and education. In Sec. B 
of Table 8.5 the cluster scores on the four attributes are shown to corre- 
late higher than the four intellectual abilities in the Holzinger problem. 
Indeed, the correlation of the tension cluster with the body symptoms 
cluster is a high level of .75. The reliability coefficients, in Sec. C, run a bit 
higher, generally in the .90s, compared to those of intellectual abilities, and 
the generalities of the four MMPI attributes, in Sec. D, are higher than 


those of the intellectual abilities clu 


newer approach was not possi 


sters. 


Persons in the cluster score space of І, B, S, T 


of each of the 310 subjects are computed on each 


After the cluster scores 
dividuals are represented as points in the 


of the four attributes, the 310 in 
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TABLE 8.5 FOUR BASIC ITEM-CLUSTER 
ATTRIBUTES, I, B, S, T, OF THE MMPI 


А. Defining Variables of І, B, S, T 


1: Introversion, defined by the following 26 items: 


377 267 52 317 —415 
- 57 172 — 309 — 264 —482 
321 86 —479 138 


201 171 509 — 353 
180  —547 292 304 


—31  —521 -79 —449 
B: Body symptoms, defined by the following 33 items: 
-243 — 230 125 72 —160 — 18 
189 МА — 68 m. 191 —192 
108 47 10 — 36 —153 14 
—190 44 23 —163 263 
62 — 55 161 — 51 — 330 
= 75 29 544 —103 = Ж 


5: Suspicion and mistrust, defined by the following 25 items: 
404 244 447 284 455 
507 348 319 438 
383 368 71 89 
390 280 558 112 
436 265 406 426 
136 469 278 316 

T: Tension, worry, and fears, defined by the following 36 items: 
555 543 448 182 158 


22 
431 442 186 32 303 351 
337 43 499 439 13 =B 
217 —242 166 335 388 365 
238 340 338 102 322 494 
506 —152 —407 473 360 492 


B. Intercorrelations between [т Scores LB ST 
a 


ЕЕ 

—| 27 E 68 MN 
| 
| 


І 
В 47 32 75 
5 27 32 48 
т 


68 75 48 | 


© >. Reliability Coefficients of Ci Cluster s r Scores M 


2 | а | s 


р. Generality across 5 All S Se 
Each Defined by Its 


a reliability | 93 


ven | Item- -clusters, | 
Best 17 Items 


Mean square of r's 65 | 55 a | ов 
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cluster score space of four dimensions, and the similarity between score 
patterns of any two individuals is represented by the euclidean distance 


D between them. 
Core O-types, O-clusters, and unique persons 
from arbitrary sectors of score space 


Each of the scales is sectioned into the same three broad categories as in 
the Holzinger problem, namely, into low (L), middle (M), and high (H). The 
result, presented earlier in Table 8.3, is that the same 81 sectors are made 
in the cluster score space. Since the total number of subjects in the two 
problems is about the same, it is easily seen that there are relatively more 
subjects with extreme profiles in the MMPI than in the Holzinger problem. 
There are, for example, five times as many cases with MMPI patterns of 
LLLL and LLML than we find in the Holzinger problem. At the other 
extreme, for the patterns HHHH and HHMH the ratio is 17:3. 

When we set a minimal criterion of five subjects in a core type, 21 
core types emerge. The profiles of the subjects in these core types are 
displayed in Fig. 8.7. There are 41 subjects in sectors that have less than 
five cases; when each of these subjects is assigned to the particular core 
type with which it has its smallest D, every one of them becomes a member 
of a core type. The scores of the assignees are plotted as zs in Fig. 8.7. The 
absence of unique individuals in this MMPI problem, compared with the 
Presence of 16 of them in the Holzinger study, is not due to there being 
more core types to join. When the 21 core types are condensed to a final 
Set of 14, there are still no subjects who are unique when the same criterion 


of uniqueness is used. 


Hierarchical structure of the 21 core O-types 


The hierarchical structure of the 21 core O-types is charted in Fig. 8.8. A 
dashed line has been drawn across the chart at the similarity level of 
O-types having a distance of about 11, the smallest distance between any 
of the 15 core O-types in the Holzinger problem. All the MMPI core types 
above the line are therefore more similar than any of those in the intelli- 


gence problem. У Я 
The first question to decide is whether the arbitrary sectioning of the 


Cluster score space has cut directly through natural clusters in MMPI score 
Space. Consider the three O-types 2, 3, and 4 shown to have small D values 
at the top left of the chart. The profiles of these clusters in Fig. 8.7 indicate 
that the low-score cutoff of 40 has in fact passed directly through the three 
Sets, and their profiles appear not to be critically distinguishable from each 
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| B S T 1 B ST | B. S T 5 B8 зә 
58 38 34 35 38 38 48 37 39 4% 43 37 43 50 38 


1 2 Б a 


FIGURE 8.7 
Profiles of 310 adults in 21 core O-types in the MMPI problem, 


other. This is why their D values are so low. Combined, they form the 
higher-order type 24. This situation is about 


together form higher-order type 23, Generally, the following additional 
higher-order combinations appear re 


asonable: 7--10-L8, 12-+15, and 
18+19. O-type 5 does not, however, seem to be justifiably combinable 
with type 11. 


t 
the same for types 1 and 6 tha 


A striking similarity with the intellectual typological structure appeals 
in the gross differentiation between the lows on the left, where all the 
types have low cluster scores in at least one attribute and none of the? 
high, and the highs at the right, all of which have at least one attribute 
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with high cluster scores in at least one attribute and none of them low. 
Recall that the symbol above each of the types refers to that attribute with 
respect to which it is a high or low extreme. For example, the higher-order 
type 24 at upper left consists of persons all low in |, introversion. From this 
hierarchical chart one can quickly read off the distinctive attributes of any 
of the core types or of higher-order combinations of them. If, for example, 
one wishes to isolate only persons who are low in at least one attribute and 
not high in any of the others, then higher-order type 37 is the target group. 
those high in at least one attribute and not low in any other 


Conversely, 
s 35, 36, 13, and 17. 


would be a combination of type 


Lows 


FIGURE 8.8 
Hierarchical structur 


e of the 21 core O-types іп the ММРІ problem. 
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Spherical configuration of the 21 core O-types 


The total configuration of the core types is displayed on the surface - 
sphere in Fig. 8.9. Тһе lows аге depicted at the left, highs at the rig : 
Higher-order combinations linked in the hierarchical chart 22-4 
in dashed lines. The symbol = signifies the fourth dimension which woul 

cover high cluster scores in the body cluster but otherwise average, as |S 


the case with core type 14, or low scores in the suspicion cluster but other- 
wise average, as is the case with core type 9. 


Final types from the profiles, the hierarchy, 
and the spherical configuration 


From a study of Figs. 8.7 to 
to have been Produced by а 
are condensed. From these 
These types are listed in Ta 
They are also Plotted on th 


8.9 those core types which seemed originally 
rbitrary sectioning of the cluster score space 
condensations a final set of 14 types emerged. 
ble 8.6, where they are symbolized T1 to TM. 
€ sphere of Fig. 8.9 at their approximate loci. 


FIGURE 8.9 
Spherical struct 
MMPI problem. 
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Those formed by a condensation of O-clusters are clearly evident from the 
notation under the column headed O-cluster Origin. T1, for example, is 
such a condensation, being a combination of O-clusters 1 and 6. Others, 
such as T4, are not a combination, remaining as an O-cluster but being 
given a new, final type number. 

The profile levels of the 14 types are shown in Table 8.6 under the 
columns headed Profile Level and Homogeneities, these being the rows 
of cluster score means printed there. The descriptive names given the 
14 types are written in the O-cluster Origin column. Type T1, for example, 
is called extrovert, healthy, trusting, relaxed, these words being based on 
the attributes in which the type is 1 or more standard deviations from the 
mean of 50. 

The number of cases in each type is shown in the columns with the 
overall heading Frequency of Cases. For example, there are 68 persons in 
the initial O-cluster that formed type T6. This number shrinks to 24 when, 
as the last condensation step in our procedure, we reassign all the 310 
persons to the final 14 types with which they have their smallest distances. 
The reason for this shrinkage is that some of the persons in the periphery 
of the large initial core O-type of average individuals have, in the reassign- 
ment, joined other adjacent types with which they have smaller distances 
than they do with T6. 

The homogeneities of the final types after the reassignment are 
shown in their columns of H values. For example, their overall H values 
are in most cases on the order of .90, signifying that the final types are 
ters. The dimension on which each type is most 


Very tight person-clus 
e with highest H value among the four values of И 


homogeneous is the on 


given in its row under |, B, S, and T. 4 төп 
Some insight into the validity of this typology is provided in the last 


three columns of Table 8.6. Recall that the supply of 310 persons consisted 
of normals, anxieties, and schizophrenics. The typology is itself formed 
Without any direct reference to these three classes of persons. Neverthe- 
less, the 90 Armed Services officers are heavily concentrated in types Т1, 
T2, and T3, and, surprisingly, in T8, the suspicious. The distributions of the 
two classes of psychiatric patients are quite different from those of the 
normal subjects. When we compare the two patient groups inter se, we 
find that the anxiety patients appear to be more concentrated in T4, the 
trusting, and T9, the somatic, whereas the schizophrenics appear relatively 
more numerous in T8, the suspicious, and Т11, the introvert. Interestingly, 
there also appears to be an excess of schizophrenics in another type, T5, 
the extrovert. Full confirmation of these differences, however, would, re- 


quire many more cases than we have in this study. 
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Multivariate selection 


The relatively high frequency of some extreme patterns of cluster scores 
and the absence of other equally probable patterns reveal how the 14 types 
are differentially selected in the four MMPI attributes. The most common 
type with extreme low scores are T2, the extrovert, healthy, relaxed, and 
Т4, the trusting. In the high extremes, the most common are T11, the 
introverts, T8, the suspicious, and T7, the somatic, tense. Just as we found 
with the intellectual patterns, there are no types of individuals that have 
both high and low extreme scores in any two of the MMPI dimensions. 
Such disjunctive patterns would not be expected in view of the generally 
Positive correlations among the four attributes; the values of correlation 
coefficients do not, however, provide precise statements about multi- 
variate patterns of scores across attributes. 

With regard to conjunctive patterns, unlike the intelligence typology, 
we do find among the 14 MMPI types an extreme type that has high scores 
in all four attributes, namely, T14, and one with low scores in all four, 
namely, T1. A particularly potent conjunctive pattern is the combination 
of extreme low scores in both the body and the tension clusters, namely, 
the healthy, relaxed person. In Table 8.6 this conjunction appears with 
relatively high frequency in three types T1, T2, and T3, inhabited largely 


by Armed Services officers. 


One intriguing and bafflin s the 
Some of the types reveal extremely high homogeneities in some of the 


attributes but low homogeneities in others. For example, types T10 and 
ТИ both have tight profiles in the body and the tension clusters, their 
I's in these two attributes being of the order .90, whereas they are simul- 
taneously heterogeneous in the suspicion cluster, with homogeneities of 


the order .50. This lack of tightness in the suspicion cluster is not a general 
| fe for some of the other types have homogeneities in 
rder .90. To explain such complex selections 


g type of selection is the discovery that 


typological feature, 


the suspicion cluster of the о 
in the different types is a challenging research problem. 


Screening method of fitting a 


new person to the typology 


The discovery of where a new person fits into this typology is achieved by a 
quick screening procedure. His responses are coded by keys on the four 
item-clusters, raw total scores are summed on the four attributes, and 
these are converted to standard scores on a scale with a mean of 50 and 
Standard deviation of 10 (statistical constants for this conversion are given 
below). Typing the subjectis then simply a matter of discovering with which 
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4. Dimension scales showing the 14 types and three Bub jects' 
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Of a person to hís most similar type. st 
l. Plot his profile. Then select the dimension scale on which he has his а 
48<10. For each type there with a 9510, count the number of scales with wh 

the given type has 4<10 with him, thus 


(1591.29 4 562 4 $ 19 11.33 1$ а 
Case 120 4 4 


Саве 221 4 1 2 1 
Саве 349 єз з 
2. Read off from the scales the d values of the types with which he has four 
values of 4<10, and compute their mean|d|, thus: 
No. 120 No. 221 No. 349 
With Tl With T4 With T8 With T13 
I *2 -6 -8 о 
B +6 -3 -3 *6 
5 -5 E * 5 
T +3 5 M к) 
112] 16 23 24 16 
Mean |2] 4.0 5.8 6.0 4.0 
FIGURE 8.10 


Finding the MMPI type with which a Subject has his best fit. 
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one of the 14 MMPI types he has his lowest euclidean distance D. This 
e screening chart shown in Fig. 8.10. 


takes only a few minutes by using th 
The four attributes are represented by four vertical standard score scales 


in the figure. On each of the scales the score values of each of the 14 types 
is represented by its type-number at the right of the scale. The tester 
writes in the subject's standard scores on this chart, as illustrated by the 
values in parentheses for three sample subjects in Fig. 8.10. The simple 
calculations of a subject’s best fit to a type are illustrated for the three 
sample subjects in the lower part of Fig. 8.10b. Since an individual is not 
to be identified with any type unless his scores lie within 1 standard devia- 
tion of the average from those of one of the types, one can quickly locate 
that particular scale on which there are the fewest number of types lying 
within 10 score units, from the subject’s plotted scores. It is then merely a 
matter of discovering which of these types is the one with which he has 
four deviations of not more than 10 units on the four scales. When this 
type is located, the actual values of the four deviations are written down 
and the absolute mean of them computed. This mean deviation should 
not exceed 10. If it does, the subject is unique. If there should be more 
than one type that yields a mean deviation below 10, the subject is assigned 
to that type with the lowest value. Since the D of a subject is the square 
root of the sum of the squares of the deviations, the type with which he 
has the smallest mean absolute deviation will also be the type with which 
he has his smallest D. If there is any question about the matter one can 


compute the actual value of D. 
s to standard scores req uires the means 


The conversion of raw Score 
апа sigmas of the norm group of 310 subjects of this study. The constants 


are as follows: 


Reliability Coefficients 


Mean с 
paoa ша 
a Split half 
! 10.7 74 .93 93 
B 10.2 7.8 92 93 
5 па 5.6 85 89 
T 14.1 8.5 92 92 


To illustrate, the Z score of a subject with a raw score of 15 on | is 


15 — 107 | 5o = 58 + 50 = 55.8 = 56 


ак г. 
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Problems in the scoring of individuals for typological analysis 


In the foregoing treatment we dealt only with the special case in which the 
observations on individuals are equally weighted cluster (subset) scores 
from complete data on subjects. We now consider how, for typological pur- 
poses, to score individuals on factors instead of clusters, how to handle 
incomplete (missing) data, how to weight cluster-defined dimensions 
differentially, and how to correct a typology for certain subgroup biases 
by means of subgroup standardization of Scores. We discuss how these 
problems are dealt with by programs of the BC TRY System. 


Differential weighting of variables in determining cluster or factor scores 


The weight matrix 


Scores of an individual on any dimension can always be conceptualized 


as the weighted sum of his standard scores on all the n variables of a study. 
To make this point clear, the general scoring program of the BC TRY Sys: 
tem, FACS, that computes dimension scores of individuals, prints out at 
the beginning of its output the “nominal weight matrix." An example 15 
given in Table 8.7, the FACS weight matrix in the Holzinger abilities problem 
for the calculation of five rational composite scores of children on F (form 
ог space), У (verbal), 5 (speed), M (memory), and N (number or mathe- 


TABLE 8.7 ILLUSTRATION OF NOMINAL WEI i 
RATIONAL COMPOSITE SCORES ААА ERAS IMPLE SUM 


IN THE HOLZINGER PROBLEM 


ЛОО E 
Variables | Scores Variables | Scores 

| Zv | Zs | Ze СЕ | zv | zs | Zr | Zu |7 

4 | | | (ЕЗІ “ез ПЕЕ UL Im Rec 

1 Fl Vis | 00) 00100) 00| 00 13 51 | oo) -00 
2 F2 Cub | 00) 00/100) 00| o0 14 ex ed "Is се 1.00] .00 
3 F3 Fbd | 00| 00/1 00! o0| oo |15 mis Nrg | 00 | .00| 00 |1.00| 00 
4 FA Loz | .00| 00 100/.00| 00 16 M16 Frg | oo| ool 001100) 00 
5 М5 Inf |100| 00 00! 00| 00 17 M17 Wn 00 | .oo| оо 100) -00 
6 V6 Стр 100 00 00 00| op 18 M18 Nf m 00 | оо |1.00| -00 
7 М7 Snt |100| .00| 00) 00| c0 |19 mig Fw | 00| 00| 00 |100) 00 
8 V8 Wd 1100) o| oo] oo| 00/2 мар Deal oo 0 00| 00 100 
9 V9 Wmn 100 00 0 00 0021 NA Риг 00 00| oo| 00/1 00 
10 510 Add | 00100) 00 00 00 22 мо Rsn 00 00 oo| 00/100 
П 511 Cod | 00100 00 00. 00 23 N23 Ser | 00 oo| oo | 00/1 00 
12 512 Cnt | 00100 00| 0 00 24 N24 Aj |.00| 00) 00| 001100 
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matics). Note, for instance, the first column of numbers gives the weights 
of all 24 tests required to form a composite score on the V (verbal) dimen- 
sion, these being the value 1.00 for each of the five verbal tests and .00 for 
all the other tests. These values mean that individual differences in the 
five verbal tests are being fully utilized in the verbal composite but that 
individual differences in the remaining 19 tests are not represented at all 
in the weight pattern. 

Individual differences in each test are expressed in standard score 
form with fixed mean and standard deviation. The raw composite score of 
a child on V is a simple sum of his 24 standard scores, each multiplied by 
the weights given in the column headed Zy in the matrix. The 19 tests with 
zero weights produce no variation at all in the V composite. The final raw 
composite score on V is therefore the sum of standard scores over the 
five verbal tests. As explained in Chap. 2, each child's weighted sum score 
is restandardized and then rescaled so that the final composite score Zv 
has a mean of 50 and a standard deviation of 10. 

Table 8.7 shows that, in similar fashion, the columns of weights are 
so chosen as to provide scores on each of the other four groups of Variables 
by the simple expedient of multiplying the standard scores of the definers 
of each cluster by 1.00 and the nondefiners by .00. 

The weights do not have to be all unities or zeros. They can be any 
values. Indeed, it is how these weights are chosen that determines the 
nature of the dimensions that are to be utilized in the typological analysis. 
Instead, the values of 1.00 could be replaced by the “best weights” of a 
regression analysis, OF they could be values preset by rational considera- 
tions. Instead of ones and zeros they could be values between 1.00 and .00, 
such as those determined bya factoring program, in whieh case the dimen- 
sions would then be called factors. A systematic consideration of these 


Cases is instructive. 


Simple sum cluster or rational composite scores 


atrix is the simple sum type illustrated in 


The most meaningful weight m й 
оп a subset of variables form а сот- 


Table 8.7, where the standard scores ОП PLA 
posite score on the dimension; each variable participating in thecompösite 


does so with a weight of 1.00; the nondefining remaining variables con- 
tribute a weight of .00. On commonsense grounds this form of weighting 
makes dimensions easier to interpret than the case in which the variables 


Show graded weights. f 3 
- id e sum type is usually derived from the 


The weight matrix of the simpl | е 
empirical key-cluster solution, which defines each dimension by the best 
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i ix i i on b 
subset of most collinear variables. This matrix is usually ү e 
Ў e = 
i ion following the empirical run. This 
the analyst in a preset solution | | qu 
of on-off weight matrix may be defined by the investigator who wishe 


indivi rational 
theoretical or social grounds to form scores of individuals on ra 
composites of the variables. 


Regression-weighted cluster scores 


A modified form of the zero-one weight matrix is one in међ еасћ ipo 
sion is defined by a subset of all the variables but the restricted num e 
of definers are not all given the equal weights of 1.00. The definers of Pra 
dimension vary in the size of their factor coefficients, which suggests 2 
they be given differential effective weights according to the degree wi 
Which they correlate with the dimension. To illustrate, in the CSA ы i 
nent of the BC TRY System the oblique factor coefficients of the variab a 
assigned to the verbal cluster, ordered by size of coefficients, as ү 
by the CSA component of BC TRY are given in Table 8.8. Since the obliq 


" ‚ ; ith a 
factor coefficients are the correlations of the five predictor variables with 


A tal ir 
hypothetical score on the "criterion" verbal domain, we can adjoin the 


ў t ions 
omain to the matrix of correlation 


TABLE 8.8 OBLIQUE FACTOR 
COEFFICIENTS AND MEAN 


CORRELATIONS ОҒ VERBAL CLUSTER 
VARIABLES AND THE 


VERBAL CLUSTER DIMENSION 


| | Mean 


Oblique Correlation 
| Factor with 


| Coefficients Definers of v 


| 84 | 68 
V9 Wmn 84 | 68 
V6 Стр | 82 66 
V5 Inf 81 | 65 
V8 Wel 70 57 
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definers do in fact have different effective weights in determining the full 
variance of the composite score on the criterion. This effective weight of a 
Predictor is a sheer function of its average correlation with the other four 
Predictors. Thus, V8 appears in the simple sum weight matrix of Table 8.7 
to have a weight equal to that of the other four predictors in determining 
the variance of the scores on V. Actually, however, since its average corre- 
lation with the other predictors of V is, as shown in Table 8.8, sensibly lower 
than that of the other four, its effective weight in determining the variance 


of the composite score on V is the least. 


Regression estimates of factors 


An investigator may wish to score individuals on factors, i.e., on dimensions 
derived by a factoring method that yields factor coefficients between 1.00 
and .00 for all n variables. There are two kinds of such factors: orthogonal 
and oblique. Orthogonal factors are usually varimax or quartimax rotations 
of principal-axes factors, or they are derived by the key-cluster component 
of BC TRY. Oblique factors are correlated dimensions such as those 
derived in the CSA program of the BC TRY System. The weighting principle 
for estimating scores on a factor is identical with that described above for 
regression weighting of a subset, except that all » variables are predictors 
of a factor. 

There are serious drawbacks to scoring individuals on factors. In the 
first place there are variables with low weights that introduce unnecessary 
and therefore undesirable “noise” into the prediction. Second, on general 
Scientific grounds, the theory that all variables play a role in all factors (or 
vice versa) is probably quite indefensible in most studies. Third, it is 


usually inevitable that some variables overlap on different factors, i.e., 
tive weights on two or more factors. As 


have significantly positive or nega ( ө, 
Such, they introduce redundant or “experimentally dependent" sources 


Of variation in the estimates of the different factors on which they overlap. 
Fourth, if there are missing data generously scattered throughout the 
Original score matrix, the problem is much more serious in replacing miss- 
ing data when factor scores are to be estimated than when cluster scores 


are desired. 
Differential weighting of the 


dimensions of typological space 


In the examples of O-analysis given in the foregoing sections of this chapter 
the scores on different dimensions computed by FACS were put in stand- 
аға scale form with a mean of 50 and standard deviation of 10. There are 
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distinct advantages in using standard scales because differences between 
objects in profile level on the different dimensions are commensurate and 
easy to grasp. Since the typological groupings discovered by O-type analysis 
are sensitively affected by the scale values of each dimension, however, 
there is no doubt that the standard scaling embodies the assumption that 
each dimension is equally important in determining the typology. | 

Some analysts may not accept the assumption of dimensional equality 
and prefer a differential weighting. A decision on this matter is not made 
in the O-type analysis itself but when the scores of the objects on the differ- 
ent dimensions are computed. Options are provided in the program FACS 
of BC TRY that permit three types of weighting. The first is equality, which 
is the standard option. 

The second option weights each dimension in accordance with the 


total communality of all variables exhausted by the dimension during the 
factoring process. Order of the dimen 


sions is important in key-cluster 
factoring, 


which has an elasticity to it in this respect not available to prin- 
cipal-axes factoring. In key-cluster factoring, the analyst can determine the 
meaning of a dimension any way he pleases by selecting its definers and 
by subjecting them to the CC factoring procedures in any position he wishes 
relative to that of other dimensions. Thus, if one defines the first dimension 
by the five verbal tests, it will take out a different portion of the total com- 
munality than if the verbal tests are preset as definers of the second, third, 
or fourth dimensions. The analyst is therefore able to decide the weight 
to be given to the verbal dimension in typological analysis under the second 
option by specifying which dimension the verbal tests are to define. 

A third option Permits the analyst to set the exact weights of the 
dimensions to any particular values he wishes. In BC TRY he does 50 bY 
Presetting the scaling factors of the dimensions, i.e., by specifying ОП 2 
detail card in FACS the values of the standard deviations of the scores that 
FACS computes on each of the dimensions. When cluster scores аге to be 
computed by FACS, one basis on which to specify the weights of the suc- 


Removing subgroup differences in typological 
analysis by subgroup standardization in FACS 
In some problems, Particularly of comparative analysis (Chap. 9), one may 
wish to control the average score levels of different subgroups before 
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O-typing proceeds. The adjustments are made when one computes the 
cluster or factor scores of members of the different subgroups. For exam- 
ple, in the comparative analysis of time and place groups in social-area 
analysis (Chap. 9) it was decided to compute the typology of all neighbor- 
hoods observed in 1940 and in 1950 taken as one large inclusive group. 
The inclusive group thus consisted of two subgroups, one observed at the 
beginning of the decade, the other at the end. 

We wanted to eliminate differences in average score level of sub- 
groups of neighborhoods separated in time, i.e., in 1940 vs. 1950. The reason 
was that if some of the variables were in some ways affected by such gen- 
eral trends as inflation, then neighborhoods with the same profiles on such 
variables within a given year, say 1940, would have different levels of pro- 
files on those variables across periods, say 1940 compared with 1950. Such 
gross effects are eliminated by standardization of scores within subgroups, 
i.e., by computing the scores on the subgroup of 1940 neighborhoods 
separately from computing them on the 1950 subgroup of neighborhoods, 
and then merging the two sets of scores as an input deck to one inclusive 
O-analysis by OTYPE. 

On the other hand, we wanted to preserve differences in score levels 
of subgroups of neighborhoods located in different places, i.e., in San 
Francisco vs. East Bay. In this case the data of San Francisco and East Bay 
neighborhoods were merged before the scores were computed and thus 
were not standardized separately. 

Before an analyst undertakes a typological analysis, he may wish to 
give some thought to the subgroup structure of the inclusive groups he is 
dealing with in order to employ the correct design of cluster score com- 
putation that will preserve or eliminate group differences according to 
his wishes. Decisions on this matter сап be troublesome. In the Holzinger 
problem, for example, where the inclusive group consisted of children 
from the factory group М5. children from the suburban group, the question 
arose whether to standardize the V, S, F, M cluster scores in the two 
groups separated before O-typing. Тһе factory subgroup came extensively 
from homes where the parental native language was not English; hence 
their profile level might be depressed as the result of bilingualism. 14 
cluster scores were computed separately in the two subgroups, differences 
between them due to language would be eliminated. But this within-group 
standardization would also eliminate all selective forces beside language 
differences that would produce mean differences in scores of the two 
groups on V, S, |, and M. Such a sweeping equalization seemed unde- 
Sirable, especially since some of the children from the suburban group 
also came from bilingual homes and would be thus penalized by separate 
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standardization. Since the matter was obviously complex and not to i^ 
solved rationally by assuming differences before they were found 9% 
if found, presuming to know the reason for them, separate Paci 
tion was not used. The effects of multivariate selection stemming from the 
subgroup differences can be assessed in the typology itself. 


O-analysis when the number of dimensions is large 


crude, one would need nearly 60,000 Subjects. 

The importance of this rel 
program OTYPE of the BC TRY 
Subjects of a group on k 


in a standard run is to st 


f 
ationship becomes acute in the use n 
System. This Program typologizes the " 
cluster score dimensions, The principle employe 
art the solution with а set of arbitrary core 0-іуре5 
TABLE 8.9 NUMBER OF PATTE 


ILLUSTRATIVE NUMBERS OFD 
NUMBERS OF CUTS 


RNS FOR SOME 
IMENSIONS AND 


Number of | Dichotomous Cut: 


1 t s | Trichotomous Cuts 
Dimensions (High, Low) | (High, Middle, Low) 


1 2 3 
2 4 9 
3 8 27 
4 16 | 81 
5 32 243 
6 64 | 729 
7 128 2,187 
8 256 6,561 
9 512 19,683 
10 1,024 
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lying in the most populated sectors of the cluster score space formed by 
trichotomizing each dimension. All the cases are then assigned to the core 
types with which they have their smallest euclidean distance, thus forming 
a set of O-types to which all the cases are reassigned. The process is then 
continuously iterated. Most of the O-types will converge on a stable mem- 


bership after as few as five iterations. 
The critical first step in the solution is to start with trial core O-types 


that are in highly populated arbitrary sectors of the score space (the arbi- 
trariness is removed as the result of iteration). Under standard options, 
the selected core O-types are those which occupy sectors that contain at 
least 2 percent (an optional value) of all the cases. 

But if there are too many dimensions, it is possible that no sectors 
will contain more than one case, so that the selection of most-populated 
Sectors is not possible and the iterated solution cannot get under way. 
For example, in a recent problem with k = 14 dimensions and N = 500 
subjects, virtually all populated sectors had only one case in them and none 
had more than two, a not surprising result considering that with 14 tricho- 
tomized dimensions there were nearly 5 million sectors into which the 
500 cases had to be distributed (the computer took .70 min to do this monu- 
mental job!). A solution was finally achieved by rerunning OTYPE with each 
of the 14 dimensions dichotomously cut at the mean, a procedure that 
gave only 11 core types, some with as few as three cases in them, a slim 


frequency on which to select core O-types. 


A procedure of determining core O-types 
unrestricted by dimensionality 


edure to follow that will always give a 


We obviously need some general proc 
by which O-type analyses can initiate 


suitable number of trial core O-types 
the iteration process. Here is such a procedure: 


1 If the number of dimensions is not greater than six, employ the standard 
от i d. 

a tmu лн six dimensions, perform this standard OTYPE analysis 
Only on a selection of the most salient and meaningful dimensions, the number of 
them being that number in the Table 8.9 which yields a total number of trichotomized 
sectors less than the total number of subjects in the problem; for example, with 500 
Cases, the maximal number of dimensions chosen for OTYPE analysis is six, which 

h the 500 cases are cast. In the case of the 


gives 729 trichotomized sectors into whic | ; 
bove, the dimensions were thus cut down to 


14-dimensional problem mentioned а i | i 
three sets. Each contained not more than six correlated dimensions that were ra- 
tionally related, and an OTYPE run was made on each set. 
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If, however, it is desirable to perform а typology on all the dimen- 
sions, here is a procedure for finding suitable core O-types: 


1 Run the standard condensation method on all the dimensions. With luck, 
there may be a sufficient number of core O-types for iteration. 


2 Ifprocedureldoes not work, change the number of partitions to dichotomies 
and rerun OTYPE under this option. 


3 If procedure 2 does not work either, utilize the convergence method to 
locate core O-types, as described in the next paragraph. 


Convergence method of obtaining core O-types 


By this method, whatever the dimensionality may be, you can always find 
a set of trial core O-types which spans the score space and which lies at 
centers of gravity in the configuration of the objects. The procedure is a 
EUCO analysis applied to a representative sample of the subjects drawn 
from the full supply. A simple procedure is to take the cluster scores of 
every (N/120)th case, thus producing a deck of 120 subjects on which to 
perform the EUCO analysis, as described in the User's Manual. Under this 
procedure, however, preset the CC5 program so that it factors on k + 1 
dimensions (if k is 15, preset CC5 to 15). The resulting definers of the k + 1 
dimensions, i.e., the REFLX1 file, are the members of k + 1 core O-types- 
Input these definers into OSTAT, which sets them up as an OTYPEI file. 
Then call OTYPE, which reads the OTYPEI file and starts the iterative 
process on the core O-type given there. 

If more than k + 1 O-types are wanted with which to initiate iteration: 


а good set of additional core O-types would be dependent clusters displayed 
in the SPAN diagrams of the EUCO analysis. 


Ghspien У 


COMPARATIVE CLUSTER ANALYSIS OF VARIABLES, 
INDIVIDUALS, AND GROUPS 


he first objective in comparative cluster analysis is to describe the 
similarity of the dimensions discovered in different groups. This prob- 


lem is known as the "comparative dimensional analysis of variables” or 
“factor matching.” In the domain of the intellectual abilities, for example, 
One may discover in a middle-class suburban group of children that the 
24 Holzinger tests of diverse specific abilities (Holzinger and Swineford, 
1939) can be accounted for by four “basic” general abilities, or factors; 
Verbal (V), space (form, F), speed (S), and memory (M), as described in 
Previous chapters. Are these dimensions identical with those found in a 
lower-class school of children of factory workers? Are me seven general 
MMPI dimensions of introversion, body, suspicion, tension, depression, 
resentment, and autism found in a group of psychiatric patients the same 
dimensions discovered in a group of normals? 

This problem hes a direct, simple Solution when approached by the 
logic and procedures of cluster analysis based upon domain sampling 
Principles and incorporated procedurally in the Bo TRY iu of cluster 
and factor analysis (Chap. 3). Dimensional analysis requires as basic data 
the intercorrelations between the variables in the groups. These correla- 
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tions are not defined in the usual way in the original 838 when p d 
parisons are across different groups, as in comparative analysis. Ra ii 
in comparative dimensional analysis what are needed are the factor со ХА 
cients of the dimensions within each group (these are referred to їп Тас : 
analysis as the ‘rotated oblique factor coefficients"). These factorial da з 
within the different groups are compared across the groups by the com 
parative cluster analysis programs called COMP1 and COMP2 of the 
BC TRY System. The methods are described below. ! ; 

The second general objective is that of comparing the typologies o 
two or more groups of individuals. When, for example, the pet 
from the factory and the suburban groups are scored on the four genera 
abilities, V, S, F, and M, the children in each group can be sorted Into 
different types based upon the patterns of the scores. The person clusters 
(or profile types) in the two groups can differ in two ways: (1) Even though 
the same kinds of profile types may appear in the two groups, those 
which occur with high frequency in one group may be rare in the other 
group. This type of typological comparison across groups is based on the 
Similarity of their ‘frequency patterns" on a common typology. (2) The 
kinds of types in the two groups may be different; those which compose 
one group may not match the types of the other group. In the BC TRY 
System, the programs expressly designed to perform the comparative 
typology of groups are the components ОТУРЕ, OSTAT, and EUCO. 

The plan of this chapter is as follows. The comparison of the dimen- 
sions of different groups (COMP) and of their typologies (OCOMP) is first 
made for the case of the Holzinger study of the abilities of two groups, the 
children from the factory and the suburban groups. Under exactly the 


same format of analysis the COMP and OCOMP analyses of the patient 
and the normal groups in a stud 


y of MMPI item-clusters are then pre 
sented. Finally, 


COMP and OCOMP analyses on the social-area data аге 
presented. Our interest in these three studies is as much substantive a5 


The study of abilities: the Holzinger problem 


ure 
Basic data structure 


In the Holzinger problem, 301 
tests of specific abilities. Thes 
the five domains of Spatial, 


grade school children were given 24 separate 
e tests are listed in Table 4.1, grouped ges. 
verbal, speed, memory, and mathematica 
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abilities. Most of these tests are similar to tests that are included today in 
test batteries of “intelligence,” e.g., the WISC and WAIS batteries from 
which verbal, performance, and full scale 105 are determined (Anastasi, 
1961, chap. 12). 

The total group of children, here called the “inclusive group," were 
children from two Chicago grade schools. Holzinger and Swineford (1939, 
P. 6) describe them as follows: “Тһе children in the Pasteur School came 
largely from the homes of workers in nearby factories. Many of the parents 
were foreign-born . . . using their native language at home 
Both parents were American-born in 29 per cent of the cases, while 
in 48 per cent, both were foreign-born.” The second school was the Grant- 
White school in the suburb of Forest Park, Illinois. In this group ". . . 
both parents were American-born in 72 per cent of the cases while both 
were foreign-born in only 15 per cent. Almost 100 per cent of the children 


Were born in the suburb in which the school was located.” 
The inclusive group can therefore be thought of as being composed 


of two ecological groups. The 156 from the Pasteur School are called here 
the “factory children,” the 145 from the Grant-White School, the “suburban 
children.” The inclusive group has other subgroup structures, notably sex 
groups and grade groups. Furthermore, the suburban children were 
Organized into two types of classrooms, homogeneous groups and random 


classes. . | 
Dimensional analysis of the 24 variables 


in the inclusive group 
Мр ыу = шош ш: 


A direct comparison of the dimensions of the 24 variables in the factory 
r separate typological structures can 


and suburban children and of their separa 
best be made when the definers of their dimensions are the same. The 


first objective, therefore, is to decide on the number of dimensions on 
Which the subgroup comparisons are to be made and on 8 common set of 
definers of each dimension. A full-cycle key-cluster analysis of the 24 vari- 


ables in the inclusive group, reported in previous chapters, discovered that 
xtracted from the intercorrelations among 


after four dimensions were е i | 
the 24 tests, their residuals were trivial. The defining variables of each 


of the four dimensions are listed in Table 7.1. 


Dimensional analysis of the 24 variables 


in the factory children 
HUI ее Ре === 


To discover the cluster structure of the tests in the factory children, a full- 
Cycle key-cluster solution of this group's intercorrelations among the 
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24 tests was “‘preset’’ on the definers of the four basic dimensions found 
in the inclusive group. The results are shown pictorially in Fig. 9.1, the 
bottom spherical plot, which is an annotated tracing of the printout of the 
diagram in program SPAN of the BC TRY System. The surface separation 
of any two tests on this sphere is a function of the correlation between 
them (technically, of their interdomain or common factor correlation). 
Two tests that correlate 1.00 have superimposed points, two that correlate 
-00 are 90° apart, represented in Fig. 9.1 by the distances between the 
three boxes that form the spherical triangle; the boxes represent the 
subset of three independent dimensions derived by factoring on residuals. 
The five verbal tests cluster tightly together at lower left in the con- 
figuration, the four speed tests more loosely at lower right, the four form 
tests at the top. The six memory tests are marked by X, denoting that 
they all project into a fourth dimension which cannot be shown since it 
projects at right angles to the three depicted in Fig. 9.1. The five mathe- 
matical tests are depicted in these three dimensions; they are all depend- 
ent on V, S, F, and M in the sense of being predictable from the four. 


Dimensional analysis of the 24 variables 
in the suburban children 


Applying the same dimensional procedure to the correlation matrix of the 
suburban children gives the configuration shown in the top SPAN diagram 
of Fig. 9.1. At lower left in the configuration is the same verbal cluster as in 
the factory group, at lower right is the speed cluster, at the top is the 
Space cluster, and the memory cluster also projects into a fourth dimen- 
sion; the mathematical abilities once again deploy centrally as dependent 
variables predictable from the V, S, F, and M dimensions. Clearly, the 
cluster structure of the suburban children closely resembles that of the 
factory children. One obvious difference is that although the cluster 
groups are about the same, they are, as groups, more separated from 


each other in the factory children than in the suburban children, i.e., 1655 
correlated with each other. 


Comparison of the dimensions within 
each group separately (COMP) 


A metric description of the within-group structures is provided by a pro- 
gram that computes the correlations between the ability clusters defined 
as oblique dimensions. The oblique dimensions are computed by the CSA 
(Cluster Structure Analysis) program of the BC TRY System. The values 
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Cluster structure of abilities within the suburban and factory group 
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of the interdimension correlations are given in Table 9.1, Sec. A. These 
correlations are known in factor analysis as the “correlations between 
rotated oblique factors” or the ‘common factor correlations.” "In cluster 
analysis they are called “іпіегіотаіп correlations," where each cluster 
is conceptualized as a domain score С; on many variables collinear with the 
observed definers of the cluster (Tryon, 1959, eq. 24). Thus, the domain 
score Су on the verbal cluster is а hypothetical score on many variables 
collinear with the observed set, V5, V6, V7, V8, and V9, shown in the SPAN 


diagram. 


The interdomain correlations, listed in Table 9.1 under the columns 
headed rcc, are computed from the raw correlation matrix using the well- 


TABLE 91 SIMILARITY OF T 
WITHIN AND BETWEEN THE 


HE FOUR BASIC HOLZINGER ABILITIES, V, S, F, M, 
SUBURBAN AND FACTORY GROUPS 


A. Similarity of Cluster Dimensions within Each Group 


Form 
Ability Group Verbal, V Speed, S (Space), F | Memory, M 
Tec | со50| rcc |cos0| rcc | соѕ0 | rcc | с050 
Verbal, V Suburban |1.00 | 1.00 | 43 | 43| 58| sg| 46| 47 
Factory 1.00 | 1.00 | 42| 43 | 35| 37] 14) .14 
Speed, S Suburban 43 | „43 | 1.00 | 1.00 | „53 | 51| 56) .54 
Factory 42 | .43/100|100| 29| 28| 39] .36 
Form (Space) F | Suburban | 58| 58| 53| 51 1.00 | 1.00 | .60| .56 
Factory 35 | .37| 129) 28 |100|100| 27| .26 
Memory, M Suburban 46 47 56 54 60 | .56 | 1.00 | 1.00 
Factory 34] .14| .39 | 36 | 27| 26 1.00 | 1.00 
B. Similarity of Cluster Dimensions between Groups (cos 0 Only) 
ГА 5. F, м. 
Vi 96 39 48 32 
5 46 89 42 48 
Fe 46 36 92 39 
Mr 28 42 41 83 
с. Generality of Each Dimenston (Reproducibility of Correlations) | 
| у 5 | F Та 
Suburban 51 | 37 47 40 
Factory 50 | 27 28 18 
D. Reliability Coefficient (а) of Cluster Score on each Dimension _ 
Suburban | — 90 | g m | 7% 
Factory 90 74 69 
| 


73 
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known formula for the correlation of sums. The rcc values are precise 
metric expressions of the degree of similarity of the four basic ability 
dimensions, V, S, F, and M, in the suburban and factory children. For exam- 
ple, the interdomain correlations between the verbal and speed dimensions 
in the two groups are .43 and .42, respectively; і.е., the two dimensions 
have almost exactly the same degree of similarity in the two groups. 
Between the other dimensions the correlations are generally higher for 
the suburban children than for the factory children, a fact already seen 
visually in the SPAN diagrams of Fig. 9.1. The rcc values are a metric state- 
ment of similarity that is displayed visually in the diagrams. 

In the lower sections of Table 9.1 other metric properties of the four 
basic ability dimensions are displayed. The "'generality'" of each, given in 
Sec. C, is the degree to which each dimension accounts for all the original 
intercorrelations among the 24 abilities. In both groups the verbal dimen- 
sion is the most general, but in the factory group the other three dimen- 
Sions are more specific than in the suburban. Of special interest to the 
typological analysis is the reliability coefficient of the raw scores on the 
four dimensions. The а reliability (Sec. D) of V 15 .90, but of the other three, 


only of the order .70 or .80. s | 
Direct comparative analysis of the 


dimensions across groups (COMP) 


We have assessed the similarity of the V, S, F, and M dimensions of the 
factory and suburban children by the subjective process of cross-referenc- 
ing their separate configurations in Fig. 9.1 and by comparing their within- 
group rcc values in Table 9.1, procedures that are indirect and inferential. 
We now compare their dimensions objectively and directly. 

Figure 9.2 displays the direct comparison achieved by the program 
COMP2 of the BC TRY System. In this SPAN diagram, traced from the 
Printout, the verbal dimension of the suburban children, labeled Ус (for 


the Grant. White School), and of the factory children, labeled Мр (for the 
ed at lower left, meaning that they are 


Pasteur School), are tightly cluster 
the two points representing the speed 


Quite similar. At the lower right are | : 
dimensions of the two schools. At the top are the two space dimensions, 
are the two memory dimensions. 


and extending into the fourth dimension 
in one diagram, the similarity of 


This cluster structure directly compares, ў 
the two-dimensional structures that we only indirectly observed above 
by cross-referencing. : ~ : 

The direct index of the similarity of any two dimensions across differ- 
ent groups is the “index of similarity" of the two dimensions (or ‘‘factors’’), 
Called the cos 6 between them. For two dimensions within a group cos 0 is 
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Comparative 
analysis 
(COMP2) 


FIGURE 9.2 
Cluster structure of abilities across the suburban and factory groups. 


but it is estimated not from 
m the oblique factor coeffi- 
values are given in Table 9.1, 
at is shown pictorially in the 


memory clusters .83. 
Briefly, the reasoning by which we designate two dimensions 25 


identical is based on the universal logic by which we conceive any tW? 
entities as being the same, namely, that they show the same pattern or 
“observations” in relation to a common set of other “referent entities- 

For example, the verbal dimensions in the two groups are virtually identical 
because their patterns of factor coefficients on the constant set of о 
referent abilities are virtually identical. The index of pattern similarity of 
any two entities on а common set of referent entities is P, called the 
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"index of proportionality” or “index of collinearity,” described in detail in 
Chap. 12 for the case of pattern similarity of the factor coefficients of any 
two dimensions. The value of the index of similarity cos 0 of any two dimen- 
sions in different groups is a simple quadratic function of P. 

To sum up, we find in Fig. 9.2, and from the metric values in Table 9.1, 
that the four basic dimensions У, 5, F, and M in the two groups are highly 
similar, But in the factory children they are somewhat more independent 
of each other than in the suburban children. An environmental explanation 
of this is that the parents of the suburban children stress scholastic 
achievement, implementing their ambition by pushing their promising 
children in all abilities, letting their less promising children fend for them- 
selves. Consistent with this theory, ме find that it is precisely in the subur- 
ban children that the scholastic institution of “homogeneous” classifica- 
tion is employed, namely, the sorting of sheep and goats into different 
classrooms. In the factory group, children generally are left to fend for 


themselves. 

There is an alternative 
stringent assortative mating on 
sort of sexual selection would g 
abilities in the suburban grouP than in 
ing would be more random. A systemat 
Vs. genetic correlation-producing agencies 
Sented elsewhere (Tryon, 1935, 1939). 


genetic explanation. Probably there is more 
abilities among suburban parents. This 
enerate a higher correlation among all 
the factory, where assortative mat- 
ic treatment of such environmental 
in the case of abilities is pre- 


Comparative typological analysis in 
the Holzinger problem (OCOMP) 


aving the same patterns of scores on the basic 


abilities, V, S, F, and M, to O-types, do we find the same typological struc- 
ture of these O-types in the factory and suburban groups? The next two 
Sections are discussions of procedures in the BC TRY System used to 


answer this question. 


When we allocate children h 


Similarity of frequency patterns of the two groups 
on the common typology of the inclusive group 


gthe typological similarity of two groups 
they show the same frequency of cases 
f the inclusive group. This common set 


The first of two ways of determinin 
is to discover the degree to which 
falling in the common typology 0 
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of O-types is shown in Table 9.2 under the general heading 2 
Typology. The first type, labeled НІ, consists of 14 children whose ра! P 
of respective mean cluster scores on basic abilities, V, S, F, and M, Linke 
36, 44, 37. These are mean standard Scores on a scale whose mean for 
301 children in the inclusive group is 50 and standard deviation 10. Under- 
lined scores of 40 (1 standard deviation below the mean) or below are 
termed "low" іп the column headed Descriptive Name; those 60 or у 
are called “high.” Type H1 therefore is described in the table as ‘Чом spee 
and memory.” " 
Some types have a high frequency, like H9, the average type, W d 
38 children in it, others with low frequency, like H2, the low verbal а 
memory type, with only 8 cases in it. The logic of typological similarity Ж 
the factory and suburban children is simple. If both groups show the sam 
frequency pattern on these common 16 inclusive classes, then they nave 
the same typological structure, but to the degree that their frequencies 
in these 16 classes differ from each other, their typologies differ. | 
It is a simple matter to count how many children in each group fa 
into the 16 classes, from Which the percentage falling into each class Б 
computed. These percentages are given in Table 9.2 under the heading 
Factory vs. Suburban. The the two columns labeled рг and 
oups, on the basis of which their 
overall index of similarity at the 
index of proportionality Р discussed 
exactly the same frequency patterns, the indeX 
is 1.00. If their patterns are utterly different, i.e., if the occurrence of 
each type in one group is matched with the absence in the other group: 


3 4 f 
then the index P is 200. The value of P for the two ecological groups м 

amount of typological similarity 
of the two groups. 


P 


Children is 478; denoting а considerable 
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For N; = 156, №, = 145, р” = .06 the equation gives сш = .027. Setting $a 
limit for d as 2 standard deviations gives a critical value (significan 
ili roximately 5 percent. 
Шы ЫЗА а 5 аге араа by (s) for suburban or (f) for 
factory, depending on which group has the highest percent. For oat 
the largest difference between percents is 12 in type H8, low verbal. 
this difference the greatest percent frequency is 13 in the factory group. 
Next is H10, high verbal, most characteristic of the suburban group. These 
two verbal types therefore represent the greatest typological RE 
between the two groups. The other significant differences indicate ү 
the suburban group falls more heavily into low memory (H3) and low is 
(H7), whereas the factory children occur more frequently in the high dne 
(H11) type. Verbal, memory, and speed dimensions most markedly 
differentiate the typological differences between factory and suburban 
children. 

Since sex differences in abilities are of universal interest, we present 
the data for determining the typological similarity of the boy vs. girl sub- 
groups, in the far right columns of Table 9.2. From the percent columns 
in the 16 classes, the index of similarity for the sex groups is P = .85 


isnificant 
Somewhat higher than for the factory and suburban groups. The significan 
differences show that boys more frequenti 


y fall into low speed and тет" 
огу, and low memory, 


the girls into high memory. 


Similarity of empirically derived 
typologies of the groups 


In similar fashion the 
ban children is also given i 
children, listed as 51 throu 


r- 
Separately worked-out typology of the subu 


e 
n Table 9.3, indicating the 13 classes of thes 
gh S12 and Unique. 
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A general impression of the typological similarity of the two groups 
can be obtained by comparing the descriptive names of the two and by 
noting from these names which types are present in both groups and 
which are present in one but absent in the other. 

A more precise comparison of the different typologies is achieved by 
including all 26 types of both groups (14 F types plus 12 S types) into the 
same analysis, from which we get exact values of the similarities and 
differences between them. The procedures for this analysis are called 
"EUCO analysis” in the BC TRY System. The logic of the analysis is simple. 
Each type is considered to be an abstract “individual” plotted as a point 
in the cluster score space of V, S, F, and M, where its locus is determined 
by its four standard scores listed in Table 9.3. Program EUCO of the BC 
TRY System computes the euclidean distance between each pair of types 
and prints these values in a pair-comparison matrix from which one can 
read off precisely the degree of similarity between any two types. 

Space limitations do not permit printing this euclidean distance 
matrix here. In its stead, however, we present a pictorial representation 
of the distances between the types in the form of the SPAN diagram given 
in Fig. 9.3. To secure this diagram, the EUCO matrix is first transformed 
to a correlation matrix by correlating columns of EUCO values, then running 
this correlation matrix through a standard 
the SPAN diagram of Fig. 9.3. 

The configuration on the SPAN diagram describes the similarities and 
differences between the factory and suburban O-types. The circles repre- 
sent the 14 factory O-types, the squares the 12 suburban O-types. Also 
included in this analysis are the 15 inclusive H types from Table 9.2. The 
sizes of the circles and squares and the length of the underline of the H 
types are proportional to the frequency of each type. The four dimensions; 
Мө Р, апа М, are also plotted, these being secured by inputting abstract 
model "individuals" whose four Standard score values are especially 
selected to enable one to plot the dimension lines as score axes. 

The large supercluster at left center consists of types all in the low 
region, meaning that generally they have standard Scores below the mean 
on all four dimensions. However, this Supercluster breaks off into two Бе!“ 
eral subclusters. The upper subcluster consists largely of suburban іуре5 
S1, S2, S3, fairly well represented by the inclusive types H1, H4, H3, and H6, 
Whereas the lower subcluster consists largely of F, or factory types, which 
with 54, are well represented by inclusive types H2, H5, and H7. From these 
facts the similarities and differences between the types in this generel 


region of low scoring can be ascertained. There are real differences in the 
typologies of the two groups in this region. 


key-cluster analysis, ending in 
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FIGURE 9.3 
Spherical representation of the 
groups. 


typologies of the factory and suburban 


e configuration reveals findings similar to 
those found from the similarity of the frequency patterns, namely, that 
Verbal, memory, and speed dimensions most markedly differentiate the 
factory and suburban groups. For example, at the top of the diagram, 
high verbal is represented only by а suburban type, S7. Low verbal 
through the lower half of the diagram is heavily dominated by factory 
types. 

How well do the 15 inclusive O-types representatively sample the 26 
different types in both ecological groups of children? This question is 
important because in the practical usage of the typology of abilities, these 
would be the types usually used for the classifications of individuals. The 
answer is provided by noting whether one or more of the 15 H types lie 
in all regions occupied by the 26 factory and suburban types. By inspecting 
the SPAN diagram and by comparing the F and S types of Table 9.3 with 
the H types of Table 9.2 it becomes clear that the 15 H types fairly cover 


the ground. 


Generally, a study of th 


198 
The study of personalities: the MMPI 


Basic data structures 


The second study selected for comparative dimensional and typological 
analysis is that of the responses to the items of the MMPI by groups of 
normal subjects and psychiatric patients. 

The variables are 118 items of the ММРІ drawn from the full item 
supply of 566 to which the subjects responded. The 118 were those tabu- 
lated in the previous chapter. The subjects were the inclusive group con- 
sisting of the normal and the patient groups. The normal subjects were 90 
Armed Services officers matched for age and education against 220 
patients. The latter were outpatients of a Veterans Administration mental 
health clinic, consisting of 70 diagnosed schizophrenics all with a history of 
hospitalization within the previous 6 years and 150 diagnosed anxiety 
Patients none with a history of any hospitalization for psychiatric disorder. 


Dimensional analysis of the 118 item-variables 


in the inclusive group 


‚ and it also includes the short- 
“dependent” item-clusters, D (depression), 
sm), whose item numbers are also given 
ems is given in Table 11.2. 


those given below on the pa 
dimensions are required to 
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TABLE 9.4 DEFINING ITEMS OF THE SEVEN 
ITEM-CLUSTERS OF THE MMPI 


A. The Four Basic Item-clusters 


1: Introversion (Full form, 26 items, reliability .93; short 
form, first? 17 items, reliability .91) 
52 1 


377 267 317 —415 
et 57 172 — 309 —264 —482 
321 86 —479 138 
201 171 509 —353 
180 —547 292 304 
=371 —521 => 79. —449 


в: Body symptoms (Full form, 33 items, reliability .92; 
Short form, first 17 items, reliability .89) 


—243 —230 125 72 —160 — 18 
189 114 — 68 = 3 191 —192 
108 47 10 — 36 —153 14 

—190 44 23 —163 263 
62 — 55 161 — 51 —330 

—175 29 544 —103 =. 2 


S: Suspicion and mistrust (Full form, 25 items, relia- 
bility .85; short form, first 17 items, reliability .83) 
447 


404 244 284 455 
507 348 319 438 
383 368 71 89 
390 280 558 112 
436 265 406 426 
136 469 278 316 


Т: Tension, worry and fears (Full form, 36 items, relia- 
bility .92; short form, first 17 items, reliability .88) bs 


555 543 448 182 158 

431 442 186 32 303 351 
337 43 499 439 13 -ІЗІ 
817 — 2 166 335 388 365 
238 340 338 102 322 494 
506 — —152 -407 473 360 492 


__В. The Three Remaining “Dependent” Item-clusters 


0: Depression and apathy (Full form, 28 items, relia- 


bility .94; short form, first 17 items, reliability .91) 
6 41 414 526 33 
—107 259 396 361  — 88 
236 418 61 384 — 46 
an  —'g 41 84 104 
“ae 549 142 T 
67 397 ' 
В: Resentment and aggression (Full form, 21 TT 
reliability .87; short form, first 16 items, reliability .82) 
94 381 145 106 
336 97 148 147 
468 536 28 443 
—399 139 162 
^ 0 m 
129 i 
A: H А А Il form, 23 items, 
Autism and disruptive thoughts ce reliability 81) 


reliability .86; short form, fst it 


559 545 342 89 
241 358 374 356 

15 560 459 40 
349 —329 297 31 
425 100 33 134 
511 345 359 


a 


Reading by columns from the left. 
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and T clusters. Since the last of these, T (tension), had the greatest gen- 
erality of the remaining four, it was decided to add T to I, B, and S as the 
final set of basic four dimensions of the ММРІ. 


Dimensional analysis of the 118 item-variables 


in the patient group 


A full-cycle key-cluster solution of the intercorrelations between the 118 
items in the patient group resulted in the cluster structure depicted in 
Fig. 9.4 (top diagram). This factoring process was “preset” on the four 
basic dimensions defined by the items of 1, B, S, and T. In the tight cluster 
at lower left in the configuration the symbols plotted as | and enclosed in а 
broken line are 15 of the 17 introversion items that define this cluster. The 
remaining two lie nearby in the direction of the two arrows. In another 
tight cluster at lower right are 16 body, or B, items; the seventeenth item 
was dropped from the analysis because of trivial communality (12 < .10). 
The suspicion cluster is at the top. The remaining four clusters, depres- 
Sion, resentment, autism, and tension, lie within the framework of the 
three |, B, S clusters. Clearly the total configuration for the patient grouP 


shows an excellent cluster structure; it is virtually the same as that found 
previously in the total inclusive group (Tryon, 1966, fig. 1). 


Dimensional analysis of the 118 item-variables 
in the normal group —— 


A radically different dimensional structure emerges in the normal group: 
shown in the lower portion of the SPAN diagram of Fig. 9.4. The dramatic 
change is in the body cluster, which was so sharply evident in the patient 
group. It is absent as a distinct cluster among normal subjects, and so are 
the depression and autism clusters. But the introversion and suspicion 
clusters do appear as fairly independent item groups. Tension and resent 


ment clusters also remain but move into a grand arc bounded by the 


introversion and suspicion clusters. It appears that only introversion and 


Suspicion are the dominant and distinctive dimensions of normal subjects 
in the MMPI item-clusters. 


Comparison of the dimensions 
within each group separately 
Precise numerical statements about the seven item-clusters in each of the 


two groups are given in Table 9.5, Sec. A (analogous to Table 9.1 in the 
Holzinger problem) and Secs. C and D. The relationships between the seve" 
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FIGURE 9.4 
luster structure of 118 MMP 


patient and normal groups. 


I items within the 


as dimensions (or oblique factors) are given in 
ain тсс Values; those for the patients are above the 
e for normal subjects below. These “correlations 
" are merely abstract metric descriptions of the 
d in the SPAN diagram, and though they are 


domains represented 
Sec. A by the interdom 
lined-off diagonal, thos 
between oblique factors 
Complex relationships depicte 
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more precise numerical statements compared with the verbal statements 
about the configuration, they are more difficult to organize conceptually. 
We must leave to the reader a detailed examination of this complex table of 
relationships, suggesting that he cross-reference his study of it by simul- 
taneously referring to the visual configuration in Fig. 9.4. 

Several obvious points need mention here. In both groups the intro- 
version and suspicion dimensions аге the most independent, and tension IS 


TABLE 9.5. SIMILARITY OF THE SEVEN MMPI ITEM-CLUSTER DIMENSIONS 
WITHIN AND BETWEEN THE NORMAL AND PATIENT GROUPS 


A. Similarity of Item-cluster Dimensions within Each Group" ФЗ 


1 B 5 | D R | А T 
p UM | i i Жел e 
| cos | | cos | cos | | cos | | cos | cos 555 
| rec 6 | тес | 0 | те 8 | тес| 0 | rec | 0 Tec | 0 | rec ЈЕ 
Teena ЭЖ Әл ee EEEN m т ДЕ | С 
| | 12 | 13] 31 | -31 | .71 | 69 | 47 | 46 | 33 | 38 | .50 
| | 
B 52 | 46 | | за | -34 | 32 | 31 | 37 | 37 | 50 | 49 | .63 | -60 
5 .07 | .06 | .61 | .52 | | 38] 37 | .66 | 162 | 65 | „61 | 59 | -57 
D „16 | .60 | 56 | 43 | 43 | 38 | | 66) 65| 57] 55 | 78 | 76 
R 37 | 32| 64 | 52 | 76 | 69 | 76 | 65 | „ва | .62 | .79 ^" 
А 81 | 45 | 90) 69 | 77 | 70| 72 | 59 | 74 68 | “је 
i 59 | 53 | 72 | -53 |.50 | 48 | 2 | 59 | 71 | .63 | 67 | 61 
B. Similarity of Item-cluster Dimensions between Groups (cos 0 only) КЕТЕ” 
Ip Bp | Sp D, Rp Ap m 
Је 31. 20 22 52 || |- 245 38 
Bx 33 49 | ЛЕН БЕЛ ПЕК у ж] 
Sy ab. | 30 74 28 | 60 | 57 
Ds 6 | 3 8 | 8 | 6 58 
Ry 33 31 64 48 | 80 57 
Ax ІШЕ? 61 45 60 69 
Tw 48 Ai. СЕП у ТЕЙ | 257 
= ane = CURA Үз. чаја А HH 
и 1 с. Generality of Each Dimension (Reproducibility of Correlations) a 
| du y Do 
І | ХЕ | 5 D R A Le 
Normals 38 51 43 53 E 52 | Ғғ 4 49 
Patients 31 19 24 52 42 36 61 
D. Reliability Coefficient of Cluster Score on Each Dimension PAS 
Normals 81 54 83 72 80 | 76 15 
Patients 90 87 83 88 з | 7 2 
| eee 


• Values for patients lie above the ruled diagonal and those for normals below. 
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most positively correlated with all the other dimensions. The body dimen- 
sion is radically different in the two groups, fairly specific in the patients 
but rather general in the normal subjects, correlating .90 with autism. This 
generality of the body dimension is misleading in the normal group, because 
we know from the SPAN configuration that the body dimension is not a 
cluster defined dimension in normal subjects but a mere sampling of 
heterogeneous items from their whole sphere of items. It is a grab bag of 
items in the normal group, just as autism is, so that their high correlation 


is merely due to both being similar hodgepodges. 


Direct comparative analysis of 
the dimensions across groups 


s of the two groups into the same COMP2 
e dimensions not only within but especially 
ly displayed. They are pictorially displayed 
alogous to Fig. 9.1 of the Holzinger 


When we project the dimension 
analysis, the relations among th 
across the two groups are clear 
in the single SPAN diagram of Fig. 9.5 (an 
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problem). The sharply differentiated and spread-out dimensions Su the 
patients, denoted by the subscript P attached to the seven dimensions, 
l, B, S, D, В, A, T, confirm the within-group cluster structure of their items 
as previously depicted in the upper sphere of Fig. 9.4. In contrast, the 
within-group structure of the dimensions of the normal subjects, indicated 
by the subscript N, confirms the narrow, essentially two-dimensional band 
ranging from the introversion dimension to the suspicion dimension. 

Consider, now, the similarity of the dimensions across the two groups 
as objectively measured by the cos 0 values, given in Table 9.5, Sec. В, 
especially those on the diagonal. The most similar dimensions across the 
groups are introversion (.71), suspicion (.74), resentment (.80), and tension 
(.73). The least similar is the body dimension (.49), a different kind of 
dimension in the two groups. 

The index of dimensional similarity cos 0 and the interdomain (com- 
mon factor) correlations ro; given as paired values in Sec. A of Table 9.5 
Show a close correspondence only for tight clusters |, S, R, and T. 


Comparative typological analysis 
in the MMPI problem 


The comparative typological objective is to discover the degree to which 
O-types of individuals, formed by classifying together individuals having 
the same pattern of standard Scores on the four basic MMPI dimensions, 
1, B, S, Т, have the same Structure in the patient and normal groups. In 


this analysis, each person was Scored by his full-form scores on |, B, 9: 
and T. 


Similarity of frequency patterns of the two groups ОП 

the common typology of the inclusive group 
In the typological analysis of the inclusive 
O-types emerged. These are listed as types М1 to M14 in Table 9.6, under 
Inclusive Typology, with the frequencies, standard scores on 1, В, S, T, and 
descriptive names. When the normal and patient subjects are sorted 
separately into these 14 inclusive O-types, the percentages falling into ther 
are the values listed in the columns labeled “% in Each Group.” Asa point 
of special interest, the patient group is separated into its two component 
diagnostic groups, anxieties and schizophrenics. 


The overall similarity of the typology of the three groups in relation 
to each other is given by their P values: norm 
EU 


group by program OTYPE 14 


al vs. anxiety groups show 8 
indicating virtually no similarity in their typological structures: 


| 
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there is a mild typological similarity of normal subjects and schizophrenics, 
with P — .41; the anxiety and schizophrenic patients, in contrast, bear 
considerable resemblance, having P = .73. 

The details of the group differences, given in the columns headed 
Differences, are of greatinterest. The normal subjects are almost exclusively 
concentrated in types M1, M2, and M3, described generally as extrovert, 
healthy, and relaxed, with a few in М8, the suspicious. The anxiety patients 
excel in the somatic types, M7, М9, М10, M13, M14, indicating persons most 
Preoccupied by body disturbances. The schizophrenic patients, compared 
to the anxiety patients, behave typologically somewhat like normal subjects, 
except that they fall heavily in the introvert type, М11. The standard errors 
of the differences are -034, .041, and .037 for the normals-anxieties, normals- 
Schizophrenics, and anxieties-schizophrenics differences, respectively. 


Classificatory scheme. 
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FIGURE 9.6 
Spherical representation of the typologies of the normals and patients. 


The patient types (abbreviated P and placed in squares) are largely in the 
cluster at the right or high region of the configuration. This separation 
confirms, of course, the finding of the previous section, but the SPAN 
configuration provides a more differentiated description. 

The locus of the 14 inclusive types (abbreviated M and underlined) are 
located in all regions of this typological space where there are normal and 
patient types. This fact means that as à System of classifying individuals: 
normal or mentally ill, the 14 inclusive types satisfactorily cover the ground. 


The study of social structure: the Social-area problem 


Basic data structure 
In previous chapters we described V-analysis and О-апађу 5 of the social 
areas of the San Francisco Bay region. In this chapter we present a com- 
parative analysis of the dimensions and of the typology described in the 
previous chapter. There are 225 census tracts in the analysis, each 
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“observed” in 1940 (prewar) and in 1950 (postwar), of which 105 are from 
San Francisco and 120 from the East Bay communities. Each census tract 
was treated as two units, one for the 1940 data and one for the 1950 data. 
Thus, there are 450 objects in the analysis. Of the 33 census variables of 
the study 14 were selected in V-analysis to represent the three dimensions 
F, family life; A, assimilation; and S, socioeconomic independence. 


Dimensional analyses 


The four subsets of data were submitted to preset V-analysis through the 
BC TRY programs CC5, CSA, and SPAN. The metric descriptions of the 
results of these analyses are given in Table 9.8, which also shows the defin- 
ing variables of each of the clusters, the oblique factor coefficients for the 
definers in each analysis (on the definers' dimension only), the cluster 
f the four analyses, and the generality 
of each cluster in the four analyses (reproducibility of correlations). The 
actual structure of the different cluster analyses, time and place, are highly 
similar. The degree of similarity and the specific differences involved are 
described below in the comparative analyses. Since the analyses are gen- 
erally similar to the analysis reported in greater detail earlier, no additional 


reliability (a) of each cluster in each o 


data are reported here. 


Comparison of the dimensions 
within each group separately 


he cluster structures within each of the four 
both in terms of the domain intercorrelations 


and the cos 0 measures. The four 3 by 3 tables for the two measures are 
merged, cell by cell, in order to facilitate comparison. For example, the 
correlation between the S and F cluster dimensions for San Francisco is 
.08 in 1940 and .27 in 1950. Comparable correlations in the East Bay com- 
munities are .14 and .15, as shown in the table. The data reported in 
Table 9.9 indicate a high degree of cluster structure consistency in the 
four sets of data. Clusters 5 and F define dimensions that are nearly 
independent, while clusters S and A have a moderately high degree of 
correlation and A and F are in the intermediate correlation range. The 
pictorial comparison of these four analyses can be made by inspecting 

14 variables defining the three demographic 


the spherical diagrams of the 2 dei 
dimensions, shown in Fig. 9.7. The four upper diagrams in Fig. 9.7 are 


marked according to the place and time for the respective data represented. 


The metric comparison of t 
groups is given in Table 9.9, 
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ITAN 
TABLE 9.8 DEFINING VARIABLES OF THE F, A, S DIMENSIONS OF METROPOL 


NEIGHBORHOODS AND THEIR RELIABILITY AND GENERALITY BY SITE BEFORE 
AND AFTER WORLD WAR II 


Cluster-defined Dimensions 


Socioeconomic Independence 
5 


Family Life F Assimilation A 


A. Defining Variables 


Mm  Managerial-profession | Oo 


Owner-occupied Sm Skilled males : 
males FI Large families Nw Native-born whites 
Df Domestics (females) |ға Family-detached F Females 
Om Own-account males Uf Housewives (unem- |Fe Foreign from 
Co  College-educated ployed females) Protestant Europe 
Am Young children Wf  White-collar females 
(males) 
B. Oblique Factor Coefficient of Definers of Each Dimension 
San San San 
De- Fran- De- Fran- De- Fran- 
finer cisco East Bay | finer 


Cisco | East Bay | finer 


cisco East Bay 
"40 | '50 | 40 | 50 '40 | '50 | '40 | 50 о | '50 | '40 | '50 


Mm 98 | .94 | .97 |1.00 00 


ез шеней 


92 | 94 | 92 | 85 |Sm 78 | 88 | .81 = 
Df 87 | 92 | 94] аи 96 1.97 | 97 | 92 |Nw | 82 | .78| 76 | 7 
Om | .90 | -86 | 69] 79 Fd | 81] 85 | в. 85 ЈЕ 69 | .68 | .58 | -52 
Co „81 | .84 | .87 | вз uf | 


83 | 87 | 85 | 83 | Fe 68 | .51 | .78 E 
Am | „81 | 83 | 73 | а/м | g1| 811911. 


C. Reliability of Cluster 


Scores on Each Dimension 655 
І 1 | (i E; E E ZU 324 4 Тере 

| 55 | 94 | 93 | за] | 95 | 96 | 96 | 91 вә | 88 | .89 | .90 
cc е ПЕН ЫМ 


D. Generality of Each 


viis NUN at idis Tu Қ Қы t o 
Generality of E Dimension (Reproducib ty of Correlations Between Definers) | 
| ғы ЕРТЕ eee E WI 
|а |а Таа “8 42 | за зз | оза | M | -52 
| кен ВН INEST] АД neu ips 


by the next two rows of figures in the table- 


д A e 
The direct comparative analysis involves comparisons of all twelV 
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TABLE 9.9. SIMILARITY OF CLUSTER DIMENSIONS WITHIN EACH 
PLACE-TIME GROUP 


Socioeconomic, 
Place-Time Group 5 Family Life, F | Assimilation, А 
Тсс со5 0 Тсс cos 0 Тсс со5 0 
босіоесопотіс, 5 | 
San Francisco'àü | 1.00 | 1.00 08 .08 .30 .29 
San Francisco '50 | 1.00 1.00 .27 10 289, .43 
East Bay '40 71.00 1.00 14 19 287 .56 
1.00 1.00 115 27 .57 .57 


East Bay '50 


Family life, F 
San Francisco 40 .08 .08 | 1.00 | 1.00 .27 .27 
San Francisco '50 207 .10 1.00 1.00 .33 .23 
East Bay '40 14 15 | 1.00 | 1.00 31 .30 
East Bay '50 15 27 | 1.00 1.00 :31 .32 
Assimilation, А 
San Francisco "40 .30 .29 7 E Y" a 
San F isco! .59 .43 А 4 ; 
an Francisco "50 Үй аб 


57 | 56 | -31 | 3 


Eas D й 
mme 31 „32 1.00 1.00 


East Bay '50 .57 .57 


Е CLUSTER-DEFINED DIMENSIONS 


TAB vo 
LE 9.10 SIMILARIT Р GROUPS (COS 0 ONLY) 


BETWEEN THE PLACE-TIM 


Socio- 


economic, S | Family Life, F | Assimilation, A 


Same place but spanning World War Il: 97 98 89 
San Francisco, '40 vs. "50 94 9 E 
East Bay, "40 vs. "50 е 

Same time but different place: 

! À 92 .90 
San Francisco '40 vs. East Bay 0 x 84 84 
San Francisco "50 vs. East Bay "50 3 91 | .89 


Mean | 


dimensions, three from each of four analyses. Тһе matrix of cos 0 values 
Was subjected to a single full-cycle cluster analysis, yielding the cluster 

structure graphically displayed in the lower left portion of Fig. 9.7. | 
The configuration оп the sphere (marked COMP2) dramatically 
reveals the identity of socioeconomic, family life, and assimilation dimen- 
sions in the four groups. All four S dimensions virtually occupy the same 
four F dimensions and the four A dimen- 


loc re, as do the болап і 
us on the sphe the basic tridimensionality of metropolitan 


Sions, demonstrating that } | 
neighborhoods is undisturbed by differences in locale or time or of course 
by difference in the people that occupy them (the people were largely 
different individuals in the groups). 
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FIGURE 9.7 


Comparative cluster structure of the 14 variables that d іс бітеп" 
sions of neighborhoods. P'inethe:thrae;-démographle 
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Similarity of frequency pattern on a common typology 


Using the typology methods of OTYPE, 11 types of social areas are dis- 
covered in the full inclusive group of 450 neighborhoods. The basic metric 
data on these 11 neighborhood O-types are presented in Table 9.11. The 
first column lists the 11 O-types, S1 to S11. Under Common Inclusive 
Typology are listed the mean standard scores of each O-type on the three 
cluster dimensions, the overall homogeneity value for each O-type, descrip- 
tive names with indications of the high and low cluster score means in the 
O-type score profile, and the percent of the 450 neighborhoods contained 
in each O-type. The high values of the homogeneities indicate very tightly 
defined O-types. 

The relative frequencies of the inclusive typologies include all 450 
neighborhoods. If the four communities separately have the same pattern 
of frequencies as the inclusive group, they have the same typology; this 
is the situation in prewar vs. postwar communities, where the frequencies 
| in the time groups. The index of similarity of the 
frequency patterns for the 1940 data and the 1950 data is P? = -99. 

The story is different with respect to locale. Because of the inconse- 
quential effect of time on the typology, we have combined the frequency 
of neighborhoods on the common typology of San Francisco in 1940 and 
1950 and have compared the San Francisco pattern with the East Bay, 
Similarly combined. These frequency patterns are in the columns headed 


are very nearly identica 


San Francisco vs. East Bay. | 1 ^ 

Specific differences in the typologies are quickly discerned from the 
values in the Difference column. Positive differences refer to types that 
have high frequencies in San Francisco but low frequencies in East Bay. 


These are, in order of size, the downtowners, minorities, and the fancy 
' ve differences in the Difference column 


a ‘есі clusters. Negati 
al на East Bay than San Francisco; these are 


refer t of higher frequency in 
D nr ees and workers 1 cluster. An overall statement of the 


degree of similarity of the San Francisco vs. East Bay is the index of similarity 
P? = 170, а high value but considerably less than the degree of similarity 


between the two time periods. 


Similarity of empirically derived typologies 


e social-area structure of different metropoli- 
gy is that it sets a common mold for them to 
tunity for types unique to a particular 
to perform an empirical typological 


А drawback to comparing th 
tan sites on a common typolo 
fit into and does not provide an орро" 
site to be discovered. It is necessary 
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analysis of each metropolitan group separately and then to project all me 
discovered types of all the groups into the same master analysis by which 
one can compare the similarities and differences in the subgroup struc- 
tures. This procedure was followed in the social-area problem. A single 
"abstract" neighborhood was made up for each discovered type, and these 
were then projected into a EUCO analysis which terminates in a SPAN dia- 
gram showing the relations of all types to each other in one grand con- 
figuration. Because this configuration has too much in it for a single pre- 
sentation, the data are presented in two parts of Fig. 9.8. The top SPAN 
diagram includes as points the 18 types found in San Francisco, 9 in 1940 
(labeled 40) and 9 in 1950 (labeled 50). The bottom sphere displays the 
configuration of the 25 types in the East Bay, 13 in 1940 and 12 in 1950, 
similarly labeled. We have also included in each configuration the 11 inclu- 
sive types as reference points (black dots), as well as reference points by 
which to identify the three score axes on the spheres. We have also con- 
nected, with lines, the types found in 1940 and 1950 that have the greatest 
similarity and have encircled them in dotted lines. 

The salient conclusions from this comparative analysis seem to be 25 
follows. Though the computer derived the 1940 typology quite indepen- 
dently of the 1950, and though few persons lived in the same neighborhoods 
in 1950 and 1940, nevertheless almost Precisely the same typology was 


recovered on these two occasions, and this despite the social disruptions 


of the war. The “reality” of the 11 Social areas that compose the common 


typology seems to be confirmed, though quite obviously the types found 
in San Francisco are heavily composed of “ 
East Bay of “hut dwellers.” 


lapping types that are equally 


cliff dwellers" and those of 
Nevertheless, there are some common over 
differentiated in the separate computer runs 
on San Francisco and East Bay. The total configuration consists of a swarm 
of points representing the 450 neighborhoods, between which there are 
doubtless no clean dividing lines. Still, there seem to be 11 “ 
of concentration which our typological method objectively 
could doubtless further divide the 11 types, or for that ma 


bine them into larger social areas. But such possibilities 
"existence" 


natural” areas 
discovers. One 
tter could com 


do not deny the 
of the 11 areas that have been discovered. There are obvious 


places in the configuration where no Bay Area neighborhoods existed either 
in 1940 or 1950. Whether they are truly zones of disjunction, signifying kinds 
of neighborhoods that cannot exist because such patterns of FAS scores 
are genuine incompatabilities, is not clear from our analysis. Perhaps they 
exist in Boston, Minneapolis, or New Orleans or will appear in the Bay Area 
in later decades. Only further comparative Social-area analyses will tell uS: 


Chapter 10 


PREDICTING INDIVIDUAL AND GROUP DIFFERENCES 
IN CLUSTER ANALYSIS 


his chapter deals with the problem of predicting differences between 
individuals in mental abilities (the Holzinger problem) and in self- 
with the prediction of differences among 


oods of the San Francisco Bay Area. 


When observing differences among objects in one domain of behavior, 
how can the observations best be organized to achieve maximal prediction 
of differences among them in other attributes? For example, in the 


Holzinger problem, where 301 school children were observed on 24 tests 
С mental speed, form ог space perception, and 


data be cluster-analyzed in order to 
in the mathematical abilities of these 


Conception (the ММРІ) and 
groups, namely, the neighborh 


measuring verbal abilities, 
memory abilities, how can these 
predict differences optimally, say, 
children? 


It is shown in this chapter that the highest level of prediction is multi- 


variate and that the best multivariate prediction is achieved when the 
Predictor is a series of object-clusters. In the Holzinger problem, for exam- 
ple, the children are cast into person-clusters, 50 Selected that in each 
such object-cluster the children are homogeneous in their patem of Scores 
on the predictor abilities. Prediction from object-clusters is called ‘‘differ- 
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ential prediction." For one typological group of children, havmg a Кы, 
pattern of scores on the predictor abilities, we can predict mathematica 
capability with a high level of accuracy, whereas for children wit another 
pattern of predictor abilities we can predict mathematical ability no better 
than chance. Ст 
Three developments have made differential typological prediction 
possible: (1) an objective means of determining different types that ы 
pose а group, (2) а means of objectively describing the degree of predie F 
ability of other attributes by each type, and (3) the capability of assessing 
the probability that predictions from each O-type, however small the num- 
ber of persons that compose it, can arise by chance. These developments 
аге now embodied in components of the BC TRY computer system. The 
component OTYPE quickly and efficiently discovers the typological break- 
down within a given group. The objective facts on the degree of predicta- 
bility in each of the types are computed in the component OSTAT. The 
component 4CAST calculates the probability that the predictions by each 


O-type are those of a mere Monte Carlo (equal probability) sampling of 
subjects from the total supply. 


Individual prediction from four “basic” mental abilities 


In this first Problem, multivariate Prediction is of mathematical abilities 
in children from four Cluster-defined "basic" mental abilities, V (verbal): 
S (speed), F (form or Space), and M (memory). The subjects are the "E 
seventh- and eighth-grade children who were observed in their responses 
to 24 ability tests of the Holzinger problem. V-analysis and O-analysis z 
these data have been described extensively in previous chapters. 


The predicted mathematical abilities 


Among the 24 tests were 5 different tests of mathematical ability, none ut 


r basic abilities, V, S, F, апа М. These five 


a 
g different sample measures din 
(e.g., Holzinger and Swineford, 1939, р. 8). The tes 
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TABLE 10.1 MULTIVARIATE PREDICTION OF FIVE MATHEMATICAL 
ABILITIES FROM V, S, F, AND M IN THE HOLZINGER PROBLEM 


Mathematical 
Ability Reliability Multiple Correlation with V, S, F, M 
Domain (Factor) 
Scores 
Raw Scores (Theoretical) 
N20 Ded 69 .59 .63 
N21 Puz .80 60 .65 
N22 Rsn 273 57 60 
N23 Ser 92 63 -68 
N24 Ari 81 64 69 


were designed to measure general mathematical ог deductive reasoning 
rather than specific arithmetic computational operations. The structural 
organization of the 24 abilities is shown pictorially in Figs. 7.3 and 7.4, which 
give the graphic position of the mathematical abilities in relation to the 


Predictor tests. 


Multiple linear prediction of 
the mathematical abilities 


The multiple correlation between the scores of the children on each of the 
five mathematical abilities tests and the four predictor clusters, V, S, F, 
and M, are listed in Table 10.1, headed Raw Scores. These multiple correla- 


lions cannot, in general, exceed the reliability coefficients of the tests, 
listed under Reliability. Predicting the mathematics abilities from “аотаіп” 


Or "factor" scores on V, S, F, and M produces the estimates of multiple 
Correlation given in the last column. The obtained multiple correlations are 
not sensibly lower than those from theoretical factors measuring V, S, F, 


and M. " Ка 
A multivariate prediction of each mathematical ability from the four 
basic predictor abilities, V, F, S, and M, yields a multiple correlation of the 


Order .65. This value is rather low for prediction, not providing much sup- 
Port for the view that a person's general intelligence ГА if measured by 
ability in mathematics, can be estimated from his standings in verbal, 
Speed, spatial, and memory abilities. But multiple correlation, like all 
Correlations, is an average statistic. While average prediction may be poor, 


for some types of persons it is much better than for others. 
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Differential prediction of mathematical abilities from 
person-clusters (O-types) based on patterns. of у, 5, 
F, апа M scores 


Figure 10.1 shows the general results of differential typological prediction, 
details of which are presented later. In the top section headed Deduction 
is the distribution of the mathematical deduction scores of two subgroups 
of children out of the total of 301, each child being represented as a dot. 
The top histogram is that of the O-type called “T1,” consisting of 13 children 
that have a particularly distinctive pattern of standard scores on the four 
predictor variables, V, S, F, and M. Next, below, is the histogram of 22 
children of O-type T14, that has a different pattern of scores on V, S, F, 
and M. The common standard score scale at the top is the standard score 
transformation of the raw test scores to a mean of 50 and a standard 
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FIGURE 10.1 


Distribution of standard scores on mathematical abilities of members of those 
Holzinger V,S,F,M O-types whose predictions are Significant beyond the .001 level. 
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deviation of 10. The dashed lines down the figure give the standard scores 
lying at 0, +1, and +2 standard deviations from the mean. 

At the far right of each histogram is given the mean standard score 
of the children of the O-type; for T1 it is 42 and for T14 it is 57. Thus, these 
two types differ in their mean deduction scores by 13 standard deviations. 
The degree of homogeneity of the deduction scores of each O-type is 
reflected by the value H, listed at the far right, .65 and .55, respectively, 
for the two groups. 

Further down Fig. 10.1 are found the mathematics scores of different 
O-types on puzzles, reasoning, series, and arithmetic. The three O-types 
in the bottom sector of the figure are those for which the prediction of 
arithmetic scores is the highest. O-type T2 has arithmetic scores that do 
not overlap in any degree on those of the two types, T13 and T15. All three 
groups are highly homogeneous, having homogeneity coefficients of the 
order .90. 

The predictions displayed in Fig. 10.1 are of 15 types out of a possible 
75 such O-type predictions. These particular 15 are those which, despite 
the small number of cases in each type, are nonchance, at the significance 
level lower than .001. That is, in each case the probability of recovering 
each of the 15 histograms in Fig. 10.1 by random sampling from the full 


supply of 301 scores is considerably less than .001. As is shown later, more 


predictions than the 15 shown in Fig. 10.1 would have been included if 


the level of significance had been changed to .01 and considerably more 


if .05 had been chosen. 
The predictor typology 


The first step in the analysis just outlined is to cast the children into 
homogeneous O-types of their patterns of scores on the four predictor 
variables V, S, F, and M. The method of doing so has been fully described 
where it was shown that these 301 children fall into 
15 O-types. These 15 O-types, with a summary description of the distinctive 
Scores on V, S, F, and M, are given in Table 10.2 in the Predictor Abilities 
columns. For example, Т1 consists of those 13 children whose scores are 
low on V, low on S, low on F, and low on M. The italicized terms low mean 
that the O-type is 1 standard deviation or more below the mean in standard 
Scores; no italicization means that the score is from 1 to 1 standard 
deviation below the mean. The type contrasting most to O-type T1 is at 
the bottom of the table, labeled T15; the scores on the 14 children of T15 
are all high on V, S, F, and M, and greater than 1 standard deviation above 
the mean on V and S. The distinctive patterns of scores of the other types 
are shown in the first columns. When there is no entry, it means that the 


in previous chapters, 
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individual's score on a particular ability is indistinguishable from a mean 
of 50, that is, within +3 standard deviation from the mean. 

The 15 O-types are the same as those given in the previous chapter, 
except that in the use of the computer component OTYPE the iteration 
procedure of the program executed one more trial for the typology of 
Table 10.2. The frequencies of cases in the 15 O-types, listed in the column 
headed Frequency, are about the same, and so are the average homo- 
geneity coefficients on the predictor variables, in the column headed Й. 
In addition to the 282 children in these 15 types, there were 19 “unique” 
children, excluded from O-types because their patterns of scores on V, 


S, F, and M did not fit satisfactorily the profiles of the 15. 


Predicted mathematical abilities 
of the 15 V, S, F, and M O-types 


bilities of the O-types сап be seen in the 
last five columns to the right under the general heading Significant. O-type 
T1, for example, is in the low class on four of the five mathematical abilities, 

in four of them. The descriptive terms 


earning low mean standard scores ! d 
of both predictor and predicted abilities of this group of 13 children reveal 


it to be rather generally low. The numbers in Table 10.2 under the columns 
titled Level and Homogeneity show in detail, for each of the five mathe- 
matical abilities, the mean standard scores Z and their associated homo- 
geneities H. It is from these values that the descriptive terms of the column 
Significant originate (see the notes to Table 10.2). 

Space limitations do not permit discussion of the mathematical abili- 


ties of each of the 15 predictor O-types, but specialists in human abilities 
Its in Table 10.2 of interest. One overall 


may find a detailed study of the resu | ШЕ ДІ 
conclusion is apparent, namely, that the number of predictor basic abilities 
associated with the number of mathe- 


in which a given O-type is extreme is 
h O-types T1 and T15 are extreme 


matical abilities in which it is extreme: bot | ДЕҢ 
in the four predictor abilities and in the four mathematical abilities; O-type 


T5 is extreme in three predictors and three mathematical abilities; O-types 
extreme in two predictors generally tend to be extreme in about two mathe- 
matical abilities; and in the large across-the-board average group, T9, the 


38 children are average in all abilities. 


The pattern of mathematical a 


Level of confidence in predicted abilities of O-types 


dicted level and homogeneity of 


nt can we be that the pre 
could not arise by chance? For 


How confide 
lues, i.e., 


the O-types are significant va 
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example, with respect to the top O-type in Table 10.2, ТІ is a selection of 
13 children whose score in the first mathematics test, deduction, is 42, 
almost 1 standard deviation below the mean of 50 in the full supply of 301 
children. What is the probability of recovering such a mean value па 
sample of 13 scores if we randomly drew 13 from the full supply of scores 
on the 301 children? The component 4CAST in the BC TRY System computes 
such a probability by drawing a large number of samples, one at a time, 
from the full supply. From these samples it computes the relative frequency 
(probability) of earning mean scores as low as, or lower than, the observed 
score. When 4CAST was actually applied to the full supply of 301 deduction 
scores and drew a very large number of samples of 13 from it, computing 
the mean deduction score of each sample, the program found from the 
full array of means that the number of samples having a mean of 42 or less 
was less than 1 in 1,000, that is, the observed mean is significant at the 
-001 level. Actually, 3,000 such drawings were made (see below), 50 that 
we know this result is quite firm. 

In Table 10.2 each mean significant at the .001 level is shown by under- 
lining its value; means significant at the .01 level carry a superscript а. In 
the Significant columns at far right, we have entered the descriptive term 
high or low only for those cases of O-types whose mean mathematical 
ability is significant either at the .01 or the .001 level. 

Which of the O-types have significant, distinctive mathematical abili- 
ties? The answer, of course, depends upon the level of confidence on? 
accepts as significant. In Table 10.3 we have listed the number of О-іурез 
shown in Table 10.2 as significant at the -001 level, at the .01 (but not 001) 
level, and at the .05 (but not .01 level). If we accept .05 in the bottom sector 
as significant, then the number of O-types that have distinctive mathe 
matical abilities at least at this level are shown in the bottom row headed 
Total; 38 of the total number of 75 predictions are significant. 

This result means, of course, that for about half of the O-types n? 
predictions can be confidently made about their mathematical abilities- 
There is something about the nature of the mathematical abilities of such 
types that appears to be unrelated to their standing in the four basic 
abilities. The very best and strongest cases for prediction are those listed 
in the top sector of Table 10.3, where there seems to be no question that 
the mathematical capabilities of the type are very distinctive and predict 
able. These are the types whose mathematical achievements аге clearly 
distinctive at the .001 level and whose histograms in the five abilities are 
graphed in Fig. 10.2. That figure therefore presents the best case for Pre” 
diction in the Holzinger problem. 


To this point we have emphasized only the cases where distinctive 
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TABLE 10.3 BY CONFIDENCE LEVEL, THE O-TYPES 
THAT PREDICT MATHEMATICAL ABILITIES IN THE 
HOLZINGER PROBLEM 


Confidence 
Level Mathematical Abilities 
Ded | Puz | Rsn | Ser | Ari 
В Ш s | 
At p < .001 
Ti T4 T3 T5 T13 
T14 T5 T13 ті Т15 
Т15 T5. T14 
T15 
At p < .01 
T8 ТЫ. Т1 Т1 T2 
T10 T2 T3 
T4 TS 
T14 T14 
Atp < .05 
T2 T2 T8 T2 Т1 
T Т14 nu T4 
T15 жу 
T9 = 
Total RA 6 9 9 7 
У = 38 (51%) 
mathematical abilities are predictable, i.e., those types whose mean mathe- 


matical abilities deviate significantly from the mean of 50. Prediction, how- 
ever, is not only a matter of finding instances of those who deviate from 
the average. It is as important to know that individuals are at the average 
as it is to know that they deviate from it, types whose mean value is not 
significantly different from the average but whose homogeneity at the 
average is extremely high. These would be O-types in Table 10.2 with non- 
significant means but with significant H values. There are no such O-types 
in the Holzinger problem, but we shall later find many instances of this 
type of prediction both in the MMPI and in the social-area problem. The 
ambiguous cases are those without descriptive terms under Significant 
in Table 10.2. For example, type T1 shows no term under Arithmetic, signity- 
ing that the mean arithmetic score of these 13 children does not deviate 
significantly from 50 or have a homogeneity coefficient significantly differ- 
ent from .00. This result means that we cannot confidently say anything 
about these children with respect to their arithmetic ability, either that 
they are below average or at the average in arithmetic ability. In this 
Holzinger problem there are many blank spaces in the far right columns, 
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а fact which supports the contention that, generally speaking, predicta- 
bility of mathematical capabilities from the four basic abilities is rather poor. 


Determination of confidence level 


Determining the level of confidence in O-type prediction is so critically 
important that the procedure should be described. The usual means of 
estimating significance in prediction problems is to resort to formulas that 
estimate “population values" in the predicted scores. These estimates 
usually require a belief that sample values are distributed normally. If 
nonparametric estimations are made, the error of prediction is likely to 
be so large that for the small numbers of cases found in many O-types it 
would be difficult to establish significance. Thanks to the computer, we 
are no longer hemmed in by such crude estimation tests. The BC TRY 
component 4CAST actually draws random samples from the full supply and 
shows quite clearly, unbound by any assumptions about normality, the 
probability of recovering either mean values or homogeneities above and 
below those actually observed in the particular O-type. Illustrations of how 
4CAST works are Бімеп in Fig. 10.2. In the top figure is the distribution of 
the means of 3,000 samplings of 20 cases each from the full supply of 301 
deduction scores. This computation was made to discover the chance 
Probability of recovering for O-type T7 a mean value equal to or less than 
its observed mean of 45. The frequency histogram of these 3,000 means 
is printed in the top section of the figure. The abscissal scale is the stand- 
ard score of each mean from the mean of all means. Below the standard 
Score scale is the raw score scale of the means. Thus, the distribution is а 
pictorialization of the "standard deviation of the mean.” It is a real dis- 
tribution, not a fictitious one. The standard deviation of the distribution 
of means is, of course, the deviation of a mean at 1 standard deviation 
from the mean of all means, having in this example the value of 


52.12 — 49.94 = 2,18 

The important matter here is wh 

is located in this distribution. 
аз ап M encircled just above th 


ere the observed mean of O-type T7 
Its value, 45.42, is plotted by the program 
€ Mean Score Scale at the lett. 4CAST thus 


less than .01. Therefore, type T7 is Placed in Table 10.3 in the bottom 
category of .05 under Deduction. 


The bottom graph in Fig. 10.2 shows Pictorially the sampling means 
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for T14, that is, the probability by random sampling of recovering a mean 
deduction score greater than T14's observed mean of 55.63. It can be seen 
that the observed mean, the encircled M at the far right on the mean score 
scale, lies in the category 2.5 standard deviations or greater. The tabled 
value in 4CAST shows that this observed mean is way off the scale of 
sample means. There are по values among the 3,000 means that are higher 
than 55.63. We therefore know that the probability of T14 earning its 
observed mean by random sampling is p « .001. For this reason, type T14 
is placed in Table 10.3 in the top confidence level of p < .001. 

One of the important properties of 4CAST is that no assumptions are 
made about the shape of the distribution or about the existence of an 
imaginary “infinitely large population” from which the O-types are drawn. 
Such assumptions are, in fact, sometimes unrealistic in biological and 


Social data. 
Camparison of univariate, multivariate, 


and O-type prediction 


We are now in a position to compare the relative effectiveness of univariate, 
multivariate, and O-type predictions in the Holzinger problem. The com- 
parisons can be made for each of the five predicted mathematical abilities, 
as shown in Table 10.4. In rows 1 and 2 is the best prediction in the uni- 
variate case. For example, under Deduction the highest correlation of the 


INCREASE OF PREDICTION IN PROGRESSING FROM UNIVARIATE 


TABLE 10.4 
TO MULTIVARIATE TO O-TYPE PREDICTION IN THE HOLZINGER PROBLEM 
Ded Puz | Езп Ser Ari 
Univariate prediction: 
1. Best prediction from 
any of the 19 individual 
variables of V, S, F, 38 37 51 48 -44 
and M from V9 | from M16 | from V9 from F1 | from V9 
2. Best prediction from 
.50 .49 53 54 


.50 


a single composite, 
from V. from 5 from V 


У, 5, Е, or M 
Multivariate prediction: | 
3. Linear multiple predic- | 
tion from all four сот- 
posites, V, S, F, and M .59 60 57 63 | 64 
O-type prediction: 
| 


from F from S 


4. Best prediction from 
O-types: the three 81 78 .96 83 | "I 
highest H values of .78 ла ga 80 87 
.72 .68 та | 73 35 


O-types 
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scores of the children on the deduction test is with the scores on the 
vocabulary test V9. The correlation with this best individual variable is .38. 
The next highest level of prediction is from single cluster-defined com- 
posites, either V or Sor F or M. In the case of the deduction test, the highest 
correlation is .50 with the verbal composite, V. The next highest is linear 
multivariate prediction from all four composites; the multiple correlation of 
deduction scores with predicted scores of V, S, F, and M is .59. 

But multiple correlation does not represent the highest possible 
prediction. For some of the O-types the prediction is a great deal better 
than that signified by a multiple correlation of .59. In the bottom part of 
Table 10.4 are listed // values for the three O-types that have the highest 
Н values given in Table 10.2. For the deduction test the three highest are of 
O-types T2, T7, and T8, these being .81, .78, and .72, respectively. Six of the 
15 O-types have // values greater than the multiple correlation value of .59. 
(We show below that the // values are directly comparable to the values of 
correlation coefficients.) 

The fact that about half of the O-types have better and half poorer 
predictions than that indicated by the multiple correlation is not a matter 
of chance. The reason is that the value of the multiple correlation is an 
average of the // values, and, therefore, about half of the O-types will do a 
better and half a poorer job of prediction than that revealed by correlation. 

Here is the relationship between correlation and 1. First, express the 


correlation between any two variables by the more general form, namely, 
the curvilinear correlation or correlation ratio у 


m 


п, = 2 ра? 


i=] 


Thus, the power of O-type prediction inheres in the ability to separate 
the types of individuals for which the prediction of a criterion is best from 
the types for which the prediction is worst. Handling the problem of pre- 
diction only in terms of multiple correlation and regression deprives uS 
of the information on which that correlation is based. O-type prediction 
tells where to look in the group for the good and the poor predictors. 
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Prediction from individual differences in four MMPI item-clusters 


We turn now to the use of cluster analysis in the prediction of individual 
differences from scores on such a "personality test" as the MMPI. 

The question in this section is this: From certain “Бавіс” scores on 
the MMPI, how well can one predict other attributes of the individuals? 
The predictor variables and the attributes predicted from them have been 
described in a previous chapter and will be summarized here only briefly 
in order to provide an adequate picture of the prediction design. The indi- 
viduals in this study were 310 subjects: 70 psychotics, 150 anxiety cases, and 
90 Armed Services officers matched with the psychiatric cases for age and 
education. Applying the BC TRY procedures called BIGNV analysis (see 
Chap. 11) which permit a key-cluster analysis in problems with up to 
5,000 variables, the 566 items of the MMPI are found to have all of their 
generality accounted for by seven dimensions. The first three of these 
dimensions account for a large percent of the communalities of the items. 
A fourth dimension is accepted in order to account for as much of the 
additional variation as possible. 

The scores of the subjects on these four dimensions, symbolized as 
|, B, S, and T, provide the four-dimensional cluster score space in which 
the typological structure of the 310 subjects is determined. There result 
14 types of individuals whose nature and structure have already been 
described. The prediction design of this analysis aims to reveal the degree 
to which scores on the four ‘‘basic’’ MMPI variables, І, B, S, and T, predict 
scores on the other three scaled attributes, D, R, and A that were not 
involved directly in the four-dimensional solution. We finally compare 
univariate, multivariate, and O-type prediction in this problem. 


The predictor item-clusters |, B, S, and T 


The MMPI item-cluster scales that were the predictors in this study may 
be briefly summarized as follows: 


1, introversion: 26 items with a score reliability of .93 

B, body symptoms: 33 items with a score reliability of .92 

S, suspicion and mistrust: 25 items with a reliability of .85 
T, tension, worry, and fears: 36 items with a reliability of .92 


The meaning of each scale inferred from item content is sharply clear. 
There is no item overlap between any of the scales. The reliability coeffi- 
cients of total scores on the scales, being about .90, are much higher than 
those usually reported for other MMPI scales. 
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The predicted item-clusters D, R, and А 
The three predicted item-cluster scales are as follows: 


D, depression and apathy: 28 items with score reliability of .94 
R, resentment and aggression: 21 items with score reliability of .87 
A, autism and disruptive thoughts: 23 items with score reliability of .86 


The meaning of these scales, inferred from item content, is also 
quite evident. None of the scales have item overlaps on the others or 
upon the predictor scales, and total score reliabilities are also about .90. 


Multiple linear prediction of D, R, and A 


The MMPI prediction problem can be restated, If we compute only the 
Scores of the subjects on the four basic item-clusters, І, B, S, and T, сап 
we predict the subject's scores on D, R, A from I, B, S, T, and thus avoid 
computing scores on D, R, A? 


The answer, in terms of traditional linear multiple correlation, is 
as follows: 
Depression Resentment Autism 
Multiple correlation with I, B, S, T 84 75 73 


These multiple correlations are rather high as multiple correlations 


во, but they are certainly not 1.00, signifying that some information about 
D, R, Ais lost. And this | i 


Differential prediction of D, R, and A 
from person-clusters (O-types) 


Before going into details, we look аға Pictorial representation of O-type 
prediction in this MMPI problem. Figure 10.3 shows the histograms of 
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FIGURE 10.3 5 
Distribution of standard scores on depression, resentment, and autism of members of 


those MMPI 1,8,5,Т O-types whose predictions are significant beond the .001 level. 


standard scroes on the depression, resentment, and autism clusters of 
those MMPI O-types whose predictions are significant beyond the p < .001 
level. There is a very high level of prediction in this MMPI problem com- 
pared with that of the Holzinger intellectual abilities depicted earlier in Fig. 
10.1. For example, types T1 and T2 are both below 1 standard deviation in 
their depression scores and are visibly quite homogeneous, and metrically 
so, as indicated by their high homogeneity coefficients, .99 and .98. Plotted 
below these two types are five others, T7, T11, T12, T13, T14, with high 
depression scores, four of them having mean standard scores at least 1 
standard deviation above the mean of 50. In short, the two lows vs. the five 
highs are separated from each other by 2 standard deviations. A similar 
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Structure is repeated both for the predicted resentment variable in the 
middle sector of Fig. 10.3 and for the autism variable in the bottom sector. 

Certain general conclusions can be drawn from these histograms: 
(1) It is obvious that prediction of low scores on D, R, A is much more 
accurate than of high. (2) For some of the O-types, the prediction of these 
three attributes is considerably higher than that revealed by linear multiple 
correlation. For example, in the depression group, five of the seven have 
H values greater than the multiple correlation of .84; in the resentment 
group seven of the eight have // values in excess of their multiple correla- 


tion of .75; and in the autism group five of the eight have // values that 
exceed the multiple correlation of .73. 


The predictor typology 


In Table 10.5 the 14 predictor MMPI O-types are listed in the columns to 
the left, each defined by the descriptive terms that indicate the degree to 
which it is distinctive relative to the average performance of individuals 
on the four predictor variables, 1, B, S, апа T. For example, O-type T1 
consists of 26 individuals; it is low in all four attributes. At the opposite 
extreme in the bottom row, T14 consists of 10 individuals that are high on 
the four predictor attributes. In general, types Т1 to T5 are low in at least 
one of the four predictor attributes. Low Scores on these four item-clusters 
may be thought to represent ‘‘positive mental health." These five types 
belong to this category. Type T6 has average scores across the board. But 
from T7 down to T14 are types that are high in one or more of the pre- 
dictors, i.e., these eight types suffer from one or more "symptoms" of 
emotional disorder. 

In the predictor attributes (the column headed 7 in Table 10.5), the 
ues are all of the order about .90 or higher; not a particularly surpris- 
ing fact since, of course, they are selected for their homogeneity on these 
attributes. These findings on the typology of the MMPI are substantially 
what was presented in the previous chapter. The typology shown in Table 
10.5 is a little sharper than that given in the previous chapter, being based 


on one additional iteration for typological structure by the OTYPE com- 
puter component. 


Н val 


At the right in Table 10.5 are the Predicted D, R, A attributes of the 14 
O-types. Comparing the descriptive terms in the last three columns with 
the words in the left column describing the status of each O-type in its 
predictor attributes gives a simple, clear picture of the predictor-predicted 
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pattern of each O-type. For example, type ТІ, being low, or “healthy,” in 
all four predictor attributes, is low, or "ћеа ћу," in the three predicted 
attributes of depression, resentment, and autism. The reverse situation 
is the case with the "'sick" group, T14, at the bottom of the table. 

The value of this chart in the clinical use of the MMPI should not be 
underestimated. Once an individual has been located in his typological 
group by the methods described earlier, а "total" picture of the pattern 
of his attributes on the ММРІ can be achieved by observing the status of 
his O-type both in the predictor attributes to the left of the table and the 
predicted attributes to the right. For example, if a given patient falls into 
type T11, one can say that he is distinctively highly introverted, with a good 
probability of being highly depressed, though in the other particulars of 
resentment and autism he is liable to have average status. 

As stated earlier, it is as important to know that an individual is 
average in a predicted attribute as it is to know where he distinctively differs 
from the average. Take the case of type T9, for example. The 16 individuals 
of this type are preoccupied with body symptoms, confessing to a large 
number of disturbed body functions. But their predicted status in D, R, 
and A is quite average in these three particulars, as shown by the values 
under D, R, A in the columns headed Level and Homogeneity. They are 
at the average in D and R because their homogeneities are of the order 
.82 to .85, significant at p < .001. With respect to the autism dimension, 
however, nothing can be said about the Status of these individuals, for their 
autism score is not significant. Ten of the Predictions are at an average 
status in the predicted attributes, as indicated by the abbreviation Av. That 
is, in one-fourth of the O-type predictions the individuals are average. 

Finally, the degree of predictability of the O-types in comparison to 
linear multiple correlation is reported in the columns of H values under 
Predicted Attributes. For example, in predicting the depression scores of 
individuals in the 14 O-types, comparing their H values with the multiple 
correlation between depression and the four predictor variables, shown 
below to be of the order .84, one finds that six of the O-types have homo- 
geneities above this value, some near 1.00. For them, depression scores can 
be predicted to a higher degree of accuracy than th 


| at indicated by а multi- 
ple correlation of .84. 


Comparison of univariate, multivariate 
and O-type predictions — 
The increase in predictability in going from univariate through multivariate 


to O-type prediction is even more sharply evident in the MMPI problem 
than in the Holzinger problem, as shown in Table 10.6. Take first the case 
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TABLE 10.6 INCREASE OF PREDICTION IN PROGRESSING FROM 
UNIVARIATE TO MULTIVARIATE TO O-TYPE PREDICTION IN THE 


MMPI PROBLEM 


D R A 
Univariate prediction: 

1. Best prediction from any .64 .59 .50 
of the best individual from 4312 | from 555° | from 4314 
items of |, B, S, or T 

2. Best prediction from a .78 .69 .66 
single composite, |, B, from T from T from T 
S, or T 

Multivariate prediction: 

3. Linear multiple prediction .84 215 .73 
from four composites, 
1, B,S T 

O-type prediction: 

4. Best prediction from .99 .97 .97 
O-types: the three 98 .93 .91 
highest H values of .96 .88 .88 
O-types 


в Worry over possible misfortunes. 
ò Feel about to go to pieces. 


of predicting depression scores of individuals. Under univariate prediction 
one would certainly not make any serious effort to predict an individual's 
depression score from any of the individual items on these MMPI scales, 
for even the best item in all four scales correlates no higher than .64 with 
the depression scores. Prediction is improved a bit if one takes a composite 
of items, either I, B, S, or T. The T, or tension, score correlation with the 
depression score is .78, for example. Multiple prediction from all four 
B, S, and T is a little better; the multiple correlation is .84. But 


scores on І, 
e made for some O-types isolated by typologi- 


more accurate prediction can b 
cal techniques. 


Conclusions on individual prediction 


What are the main conclusions from this systematic study of prediction 
in the two areas of intellectual ability and the MMPI type of self-report? 
First, it is evident that as one moves from the very specific responses of 
individuals in situations, such as to a particular item or stimulus pattern, 
through increasing composites of his performance across a variety of 
situations, then prediction to other “outside” attributes improves. This 
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fact can be understood on domain sampling principles, because the more 
extensive the compositing of the predictor behaviors or the more multi- 
variate the design of the battery of predictors, the greater the likelihood 
of sampling determinants of individual differences that would operate in 
other “outside” situations. This finding is of little encouragement to those 
psychologists who seek to find highly controlled specific experimental 
situations in which to get a “риге” measure of a given behavior from which 
to make predictions. Behavior in highly controlled settings is likely to elicit 
only the most specific kinds of determinants that are unlikely to be elicited 
in other "outside" situations. 

The second major conclusion is that the predictability of an indi- 
vidual's behavior depends upon who the individual is. Domain sampling 
works out, as it were, differently in different individuals. For some indi- 
viduals, knowing their pattern of basic intellectual abilities may enable one 
to make accurate predictions of other abilities or cognitive aspects of their 
behavior. With other individuals, however, complete knowledge of their 
basic abilities may be of no earthly use in predicting how the individual 
will behave in other kinds of situations. Differential O-type prediction is a 
fact the behavioral scientist must live with. It is unlikely that he will dis- 
cover "laws" of prediction that will apply to everybody. 


Predicting group differences: the social-area problem 


This section differs from earlier sections in this chapter by focusing on 
objects that are groups of individuals rather than single individuals. The 
objects in this case are neighborhoods (census tracts) of individuals іп а 
metropolitan area consisting of the population of San Francisco and the 
East Bay communities. 

Knowing the prewar 1940 scores of a given neighborhood on its three 
basic demographic FAS dimensions, to what degree can one predict other 
characteristics of it, e.g., the same FAS characteristics observed a decade 
later in 1950 or such wholly different attributes as the subjective attitudes 
of the people in the neighborhood as revealed by their voting behavior? 
If the forecast is discovered to be accurate, it leads to a second question. 
May there not be a more basic set of dimensions of social structure than 
those defined by demographic features, one representing determinants 
of both the demographic and the attitudinal characteristics of the people, 
a set enduring over a long period of time? 

The general design of this analysis is (1) to determine the FAS pre- 
dictor values of each of the 243 neighborhoods of the Bay Area in 1940, 
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before World War II. (2) Predictions are made from them to other concur- 
rent demographic and attitudinal features of the neighborhoods. (3) Post- 
war characteristics observed up to 15 years later are predicted, spanning a 
socially disrupting war and postwar period, at the end of which it is doubt- 
ful that many individuals who originally lived in the neighborhood still lived 
there. Do neighborhoods remain ecologically constant despite the turnover 
of specific inhabitants? 


The predictor prewar dimensions, F40, A40, 540 


The 33 demographic characteristics of the 243 neighborhoods of the Bay 
Area have a cluster structure already described in previous chapters. In 
order to distinguish the 1940 variable clusters from the 1950 variable 
clusters we denote the respective clusters F40, A40, S40, for 1940 and F50, 


A50, S50, for 1950. 


Concurrent prediction of social achievement 
(Ach40) and vote for Roosevelt (P40) 


To discover whether the basic three FAS dimensions predict scores on 
other demographic characteristics concurrently observed in 1940, a subset 
of six variables not included as definers of the three basic FAS dimensions 
was selected as another independently observed demographic cluster. 
This achievement cluster, Table 10.7, Sec. A, labeled Ach40, is defined by 
six characteristics that generally stand for white-collar middle-class 


achievement in American metropolitan society. 
How well do the three dimensions predict an utterly different domain 


of behavior, namely, the voting attitudes of individuals in these neighbor- 
hoods? To find out, P40, the vote for Roosevelt, was computed from the 
election tabulations. It seems clear that such a vote in a neighborhood in 
1940 reflected the people's deep-rooted beliefs and biases about the role 


of the government in their personal lives. 


Forecasting postwar demographic 
and voting attitudes 


A socially disrupting great war uprooted the people from 1940 to 1945. The 
lives of most people were deeply affected. It was not uncommon for per- 
sons, even families, to move from one place to another. The physical 
neighborhoods as defined by the census did not change, but the bodies in 
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TABLE 10.7 MULTIVARIATE PREDICTION OF 
CONCURRENT AND POSTWAR DEMOGRAPHIC AND 
VOTING ATTRIBUTES FROM PREWAR FAS SCORES 


IN THE SOCIAL-AREA PROBLEM 


A. Concurrent Prediction 


(1940, N = 243 Tracts of Bay Area) 


Multiple 
Correlation 
Reliability | with FAS40 
Ach40, achievement: 96 92 
Wm 
Sc 
Ch 
Em 
Rn 
Re 
P40, political 40: .88 .93 
Vote for Roosevelt, 1940 
B. Postwar Prediction 
(1947, 1950, 1954; У — 105 Tracts in San Francisco) 
Multiple 
Reliability Correlation 
F50, family life« .94^ .98 
A50, assimilation? 85^ 95 
S50, socioeconomic: 92^ .96 
P47, political 47 
Vote for mayor, 1947 E 90 
P54, political 54: 94 .85 
NdAg 
HsEm 
DmGv 
ExFI 
T54, taxation 54: .88 .82 
HsBn 
AgBn 
ExBn 
VtBn 
E54, ethnic-religious 54: 87 .66 
WIEx 
ChEx 
CoEx 


^ For definers see Table 9.8. 


^ Reliabilities are for 225 tracts of the Bay Area, 
© Reliability estimate is highest correlation with any other variable. 
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them frequently did. But did the changing people in these neighborhoods 
also change in their demographic and attitudinal characteristics? We can 
find out by predicting scores of these neighborhoods on the FAS dimen- 
sions as observed a decade later in 1950 and on their postwar voting in 
elections. 

The important postwar predicted variables used in this study are 
shown in Table 10.7, Sec. B. The FAS scores in 1950 are based on the same 
defining variables used in 1940. In 1947 scores on the political dimension 
P47 were calculated from the vote for mayor in San Francisco. An analo- 
gous political dimension score, P54, was determined in the 1954 election 
and two other voting-attitude cluster scores were also secured, namely, 
T54, taxation, and E54, ethnic-religious. 

The 1954 voting-attitude cluster scores were calculated only for the 
neighborhoods of San Francisco. There were 31 state and city propositions 
and candidates on the ballot. The results of the analysis of these data 
were reported in Chap. 7. The voting-attitude clusters are listed in Table 
10.7, Sec. B. 

Multiple linear prediction of demographic 


and attitudinal attributes 


As in the prediction study of Holzinger mental abilities and of the MMPI 
item-clusters, the first prediction performed is by the standard linear 
multiple regression method. The results take the form of multiple correla- 
tions between the three predictor FAS scores in 1940 and the two concur- 
rent and the seven postwar attributes. These multiple correlations are 
given in Table 10.7, extreme right column, .92 for Ach40, and .93 for P40. 
These are high multiple correlations, meaning that if one knew only what a 
neighborhood's three FAS40 scores were, then by properly weighting those 
scores by regression coefficients of the predicted variables on the three 
predictors, one could forecast scores on the achievement variable and the 
vote for Roosevelt variable with great accuracy. 

Forecasting postwar from prewar characteristics is also astonishingly 
high. The results, in terms of multiple correlation, are in Sec. B of Table 
10.7. How well can one forecast a neighborhood's postwar demographic 
characteristics in 1950 from its prewar FAS values? Almost perfectly for 
F50, family life, since the multiple correlation with F50 is .98! And the multi- 
ple correlations with the other two demographic characteristics, A50, and 
S50, are almost as high. But the voting attitudes of a neighborhood after 
the war are also highly predictable from its prewar demography. The multi- 
ple correlations between ҒА540 and the political attitude vote in 1947 and 
14 years later in 1954 are of the high order of .90 and .85. There is a decrease 
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in forecasting accuracy as one moves from demographic to attitudinal 


features, especially of the ethnic-religious attitude, whose multiple corre- 
lation with FAS40 is only .66. 


Differential prediction from neighborhood 
O-types (social areas) 


Since multiple correlation is a statement of average prediction across all 
individual objects, we now turn to the more powerful kind of prediction, 
the differential prediction for different social-area types of neighborhoods. 
The basic histograms are given in Fig. 10.4, which shows how three social- 
area types are arrayed on the three predictor and seven predicted vari- 
ables. Each of the 39 neighborhoods of the whole Bay Area has scores on 
the three 1940 FAS clusters about the same as the others, as shown in the 
three top histograms. The Scale across the top is the Standard score scale 
into which scores on all of the variables have been converted. The mean 
Standard score of all census tracts in the Supply is 50, and the standard 
deviation is 10. For 39 neighborhoods of type T1 the mean Ғ40 score in the 
very top histogram clusters around a value of 56, about $ a standard devia- 
tion above the mean of the whole Supply. In the second histogram the A40 
Scores center around 43, and in the third the 540 score clusters around 41. 

It is not Surprising that the neighborhoods of type T1 are so homo- 
geneous in the three predictor variables because they 


мау. But what 15 surprising is the discovery that in mo. 
dicted variables on which the 


were selected that 
st of the nine pre- 
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Distribution of standard score 
of four FAS40 O-types. 


type T5 are the three histograms on the 1940 predictor variables of ҒА540. 
These 33 neighborhoods cling quite homogeneously around the middle 
standard score of 50. In the nine histograms of predicted variables, both 
concurrent and postwar, this average social-area O-type tends to have 
values on the predicted characteristics that are also near the average 
standard score of 50. This fact does not mean that the neighborhoods of 
type T5 behave like a random sample of neighborhoods drawn from the full 
supply. They behave homogeneously at the average, as revealed in the last 
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column of numbers headed И in Fig. 10.4, showing the homogeneity coef- 
ficients of the O-type on the respective variables. If all the neighborhoods 
of a given type have exactly the same score, the // value is 1.00, but if the 
scores behave like a random selection from the full supply, H is .00. The H 
values for the predicted variables of type T5 are of the order .90 or higher, 
signifying a high degree of prediction (or low error). Indeed, the homo- 
geneity of the neighborhoods of type T5 in the predicted variables is about 
as high as that of the extreme social area composed of combined types T10 
and T12 at the bottom of the figure. 


Predicted attributes of the 12 social areas 


The complete findings on predicted characteristics, both concurrent and 
postwar, are given in Table 10.8. Since this table is complex, consider first 
the top row, for type T1. Across the row in the 1940 concurrent predicted 
variables, Tl's high family score is accompanied by high political, the vote 
for Roosevelt, and its low assimilation and low socioeconomic is accompanied 
by low achievement. For a contrast, compare this pattern with the predictor- 
predicted pattern of T12 at the bottom opposite extreme. The across-the- 
board average social-area type is T5. When a given characteristic deviates 
from the mean by 1 standard deviation or more, its descriptive term 15 
italicized іп the table, as in the case of TI's high under Р40, vote for 
Roosevelt. 

The main findings of Table 10.8 are (1) that for all 12 types the stand- 
ard score level of postwar attributes is generally predictable at a very high 
level of significance and (2) that the homogeneities, usually in the .90s, are 
nearly as high on the predicted as on the predictor attributes. 

Space limitations do not permit a detailed exploration of disjunctive 
patterns that do not occur, thus disclosing the operation of social and 
biological forces that prevent certain combinations of demographic and 
attitudinal characteristics from appearing in any neighborhood. For exam- 
ple, where are neighborhoods that are low on the assimilation variable and 
high on the socioeconomic? There are none. Social discrimination in our 
society apparently so intimidates members of minority groups that the 
well-off individuals among them who can “pass” Seem to avoid forming 
habitat groups that would identify them as a homesite body. Conversely, 
the opposite combination, high scores on the assimilation variable, and low 
on the socioeconomic variable, does occur in the conjunctive type T6. Social 
selection does not prevent the grouping together of poor whites who, when 
also characterized by high scores on the family life variable as is T6, become 
the respectable “роог but decent." 
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Constancy of social areas despite extensive turnover 


The best prediction is of the same characteristic over time. There is virtually 
no change in each neighborhood's distinctive demographic characteristics 
from 1940 to 1950, despite the fact that the people in most neighborhoods 
underwent much war dislocation. Here are the facts, expressed as correla- 
tions between scores on the neighborhoods' three demographic charac- 
teristics observed in 1940 and the same ones 10 years later in 1950: 


Demography: 


Тераоезо) = .98 
Тело сл во) == .95 
7540) (вво) == .95 


The constancy of voting attributes is nearly as great, despite the fact 
that the attitude objects voted on were not identical at the different times: 


Attitude: 

Т(радјрат) == .98 
Т(радцрба) == .84 
Topaz) (psa) == .93 


Urban social areas аге not only stable but appear to persist in 
tneir biosocial characteristics unaffected by the particular persons who 
inhabit them. А very substantial proportion of the People living in the 
neighborhoods in 1940 moved out during the war decade and were replaced 
by a new crop. This fact is revealed in the annual percentage turnover as 
reported in the 1950 census. The tract statistics include, for each tract, the 
Percent of persons not in the same house in 1950, compared with 1949. 
For the 105 neighborhoods of San Francisco the annual turnover was 


Annual turnover, % | 5-9 | 10-14 | 15-19 | 20-24 25-29 | 30-34 


| 
3 
| = 
| 


39 | 29 10 | 5 


Number of tracts l | 2 


The lowest turnover is about 10 Percent, the highest about 33 percent. The 
average tract has an annual turnover rate of about 20 Percent. If this rate 
did not sensibly change for each one of the 10 
1940, and if selection were random, a little simpl 
for every 100 individuals in an average neighborhood in 1940, 80 would be 
left in 1941, 64 in 1942, and so on up to 1950, at which time, when the census 
man came to enumerate them, only 11 out of the original 100 would be 


years from 1950 back to 
е arithmetic reveals that 


PREDICTING INDIVIDUAL AND GROUP DIFFERENCES 247 


for those neighborhoods with the lowest rate of turnover 


present. Even 
10 percent, after 10 years two out of every three persons 


of around, say, 


would have been replaced. 
Despite this copious turnover, the characteristics of the neighborhood 


remained relatively unchanged. The demographic and attitudinal charac- 


teristics of social areas are, as it were, supra-individual. They appear to be 
invariant with respect to who moves in and moves out. To illustrate, the 
attitudes and values of Skid Row seem to go on forever despite the fact 
that during the course of a 10-year period most of the original inhabitants 
will have died. Only an urban renewal scheme, which bulldozes it away, can 
change it. An exclusive, conservative neighborhood seems to remain that 
way whoever moves in or out, especially in a “contained” city with fixed 
limits. 

Comparison of univariate, multivariate, 


and O-type prediction 


We also find here, just as in individual prediction, that as one proceeds 


from univariate through multivariate to O-type prediction, there is a SyS- 
tematic increase in the predictability of the characteristics of neighbor- 
hoods. The facts are shown in Table 10.9. The first row, devoted to uni- 
variate prediction, lists the highest correlation of each predicted variable 
with any one of the three single predictors, F40, A40, or S40. Thus, for P40, 
vote for Roosevelt, its highest correlation is — .87 with the S40 socioeconomic 
score. The second row gives linear multiple correlations with all three pre- 


REDICTION IN PROGRESSING FROM UNIVARIATE TO 


TABLE 10.9 INCREASE ОЕР 
REDICTION IN THE SOCIAL-AREA PROBLEM 


MULTIVARIATE TO O-TYPE Р 


Concurrent 


Postwar 


| 
Асћао | P40 | F50 | А50 | S50 | РА7 | P54 T54 | E54 


Univariate prediction: 
1. Best prediction from 
a single composite, 
F40, A40, S40 
Multivariate prediction: | 
2. Linear multiple 
correlation from | | 
F40, A40, 540 92 | .93| .98 
O-type prediction: | 
| 


— 87 98| %| 95 |- 81 |— 81 |- 76 |— 53 
from | from | from | from | from | from | from 


540 | F40 | А40 | sao | 540 | 540 | F40 | F40 


1.00 


3. Best prediction from 99 97| .99 99 98 .99| 1.00 99 
O-types: three highest | 97 97| .99 | 99 | .97 98 | 99 92 93 
97| .92 87 


И values of O-types 97 95 | 
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dictors. Finally, the bottom rows of the table give the three highest // values 
taken from Table 10.8. Here is the best prediction of all. Prediction for 
neighborhoods of those types that have H values near unity is nearly per- 
fect. But for neighborhoods of other types that have lower // values it is, 
of course, poorer. The power of differential prediction from knowledge 
of the social areas to which neighborhoods belong is that it tells one where 
prediction is nearly perfect. Multivariate prediction by multiple correlation 
does not yield such a discriminating forecast. 


Basic tridimensionality of social areas 
Summarizing, we have discovered (1) that 33 demographic characteristics 
of neighborhoods can be described without loss of generality by only three 
cluster defined composites drawn from them and (2) that 31 voting atti- 
tudes of the neighborhoods also can be accounted for by only three 
cluster defined attitude composites. In differentiating the neighborhoods, 
therefore, nothing general is lost if we ignore all the original 64 measures 
taken on them and in their place describe variation among them by scores 
on only the six clusters. (3) We have discovered that the three prewar 
demographic cluster scores predict attributes of neighborhoods from 8 to 
15 years later, despite a great turnover of people in the intervening time. 
Putting these findings all together suggests that there may exist not six 
but only three basic dimensions that differentiate the neighborhoods. 

The way to discover how many basic dimensions exist is to project the 
Scores on demographic and attitudinal clusters into one single crucial 
inclusive key-cluster analysis. If the demographic three are different from 
the attitudinal three, then a six-dimensional Solution will result. Indeed, 
perhaps at least eight may be required if time is an additional dimension, 
for it is possible that all measures observed in 1940 may show special 
correlation among themselves different from a special association among 
those observed 10 years later. The crucial test consists of a key-cluster 
analysis of the following 16 scores in the neighborhoods of San Francisco: 
the three FAS measures observed in 1940 and 1950; the three political 
attitude variables, P40, P47, P54: the other two voting dimensions, T54 and 


E54; and five additional variables defined especially for this inclusive analysis. 
The additional five were these: 


Income 
149 Income (per capita) 1949 
Turnout 
V40 Turnout of registered voters 1940 
V47 Turnout of registered voters 1947 
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Movement of home (turnover of people in the neighborhood) 
M40 Moved home in 1940 
M50 Moved home in 1950 


This factorial analysis of the 16 variables is called the DAT study, 
since it is a simultaneous dimensional analysis of Demographic, Attitudinal, 
and Time variables. When the matrix of correlations among the 16 DAT 
variables is factored by the key-cluster method, a three-dimensional solu- 
austs the initial estimates of communalities of all 16 


tion completely exh 
-axes solution on 


variables. To check this extraordinary finding a principal 
the matrix with 1.00 in the diagonal was run. Three principal-axes dimensions 
account for 94 percent of the total variance of all 16 variables. Crediting 
the remaining 6 percent of the total variance to unique or trivial common 


variance, we are compelled to accept this conclusion: Only three dimensions 
ighborhoods in demographic 


account. for nearly all the variance among пе: 
and attitudinal characteristics covering a span of 15 years! 
What is the nature of the basic three dimensions that determine 


variation among neighborhoods? The answer seems clear. The structure 
of the relationships among the 16 variables depicted in the spherical con- 
figuration given in Fig. 10.5 gives the clue. Recall that in such a spherical 
display, the points represent the 16 variables and that the spatial separa- 
tions between them are functions of their intercorrelations, as described 
earlier. The configuration of the purely demographic characteristics at or 
near the corners of the spherical triangle, the S's, F's, and A's, is exactly 
the same as shown in Fig. 9.7, but now each of the three clusters also 


includes the attitudinal characteristics. 

We designate the three inclusive basic clusters as “conservatism,” 
“territoriality,” and “exclusiveness."’ Observe, first, the conservatism 
cluster at the lower left: there is almost perfect collinearity between the 
socioeconomic dimensions, 540 and $50, and the political dimensions, 
рдо, P47, and P54. The political dimensions have been "reflected," as 
signified by the minus sign in front of them, meaning that they are asso- 
ciated with the socioeconomic variables in a reflected state. For example, 
a high P40 score, the vote for Roosevelt, of a neighborhood is associated 
with a low socioeconomic score, S40 or $50. If we were to form an inclusive 
composite score on each neighborhood consisting of these five variables, 
ore on it would therefore refer to a neighborhood in a 


one with a high sc 
independence, anti-Roosevelt condition, generally 


high socioeconomic 
recognized as that of conservatives. 
In analogous fashion, the second dimension at the right, labeled 


territoriality, embraces the family life characteristics, F40 and F50, coupled 
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FIGURE 10.5 
Spherical configuration showing the relationships among the 16 demographic and voting- 
attitude clusters of 105 San Francisco neighborhoods Spanning 1940 to 1954, 


with the anti-tax attitude, —T54. Associated with it are the four new indexes 
of community involvement, voting turnout, V47 and V40, and nonmoving 
or physical stability, — M40 and — M50. A high score ona composite of these 


The third cluster, exclusiveness, is rather obviously centered on the 
two assimilation characteristics, A40 and A50, that are somewhat loosely 
associated with the anti-ethnic vote, — E54. This exclusiveness dimension 
is the majority: minority dimension, a high score on which would charac- 
terize a neighborhood consisting of native White Protestants of northwest 
European origin holding negative attitudes toward minority groups. 

A final point of interest is the locus of money in the configuration, as 
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measured by family income, 149. Note that 149, not surprisingly, is asso- 
ciated both with the conservatism dimension and with the exclusiveness 
dimension but that it is independent of the territoriality dimension. 
Clearly, the conservatism and territoriality dimensions are relatively 
independent. But the exclusiveness dimension is positively related to both. 
The purely demographic characteristics of exclusiveness, namely, A40 and 
A50 are, however, more highly associated with the conservatism dimension 
than with the territoriality dimension, but the attitudinal component, — E54, 
which is tinctured with a reluctance to spend money on tax relief for 
minorities, leans toward territoriality with its anti-tax component, —T54. 
It seems quite likely that these three basic dimensions are determi- 
nants of the social structure of most urban neighborhoods of modern 
America. The correlations among the three may not be the same from 
city to city, but the dimensions will probably be there. Other investigators 
have observed one or more of the three demographic aspects of these 
three basic dimensions. In the objective measure of Warner's social class 
dimension, some of the definers of itare those that define our S dimension 
(Warner, Meeker, and Eels, 1949). The intuitively derived urbanization, 
segregation, and social rank variables of Shevky include some of the 
same definers of our F, A, and S, respectively (Shevky and Bell, 1954). 
e the Borgatta-Hadden factors recently reported 
used by them in their study of urban 
ctor | has similar definers to our 


Particularly relevant ar 
by Cartwright and Howard (1966) and 
gangs. Their socioeconomic status fa 


their suburb and stable family factors Il and III include definers 


S cluster, 
privation factor IV is prob- 


of our F dimension, and the disorganization-de 


ably close to our A dimension. 


The findings here clearly establish that sc 
conservatism, territoriality, and exclusiveness, are 


ficant variations among the neighborhoods. 
cessary to 


ores on only the three 


higher-order clusters, 
necessary to predict all the signi 
From a practical measurement viewpoint, however, it is not пе 
include voting behavior when we wish to describe a given neighborhood's 
characteristics with regard to territoriality, exclusiveness, and conservatism. 
All we need are the easily obtained, objective demographic characteristics, 
S, for this analysis has shown that the voting attitude of these 


F, A, and 
fectly predictable from the demographic 


neighborhoods is almost per 


features. 
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UNRESTRICTED CLUSTER AND FACTOR ANALYSIS 


efore digital computers were commonly available to the multivariate 
B researcher, a major problem was the clerical difficulty of multivariate 


analysis. Even in relatively small problems such as the 24Nariable Holzinger 
study, the amount of labor required to execute only the first step in the 
analysis of individual differences was staggering. Many weeks of hara work 
could be involved in factoring a correlation matrix. Now that digital com- 
puters are available, such studies can be executed in a matter of seconds 
of computer time. Computer programs, however, are generally written 
with a definite limit to the number of variables and objects that can be 
entered into an analysis. In the BC TRY System the limit on the number of 
variables is 120 on the IBM 7094 and 90 on the CDC 6400. Limitations on 
the number of subjects for V-analysis are not so stringent, 9,999 on the 7094 
and 5,000 on the CDC 6400. In O-analysis the number of objects that can 
be handled by EUCO analysis is the same as the number of variables, 120 
or 90. In OTYPE analysis this restriction is not present, and the limit on the 
number of objects is 9,999 or 5,000, depending on the program. One might 
Suspect from these statements that the computer has a very definite limita- 
tion in the analysis of data. The limitation is not a logical restriction but a 
practical one. When the number of variables is very large, special techniques 
in computing must be used in order to handle the large numbers of data 
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involved. These special techniques are more difficult to write into a сот- 
puter program, and they take a longer time to execute for a given number 
of variables. Generally, programs written for a smaller number of variables 
(like 120) cost less to write and to execute than programs written for a 
larger number of variables (like 500) even if they are both applied to the 
same data. A major cause of the difficulty is the necessity of assuming, in 
the large-scale program, that the memory of the computer will not hold 
all the data at one time. Consequently, more or less elaborate schemes are 
necessary to store and retrieve segments of the data as the need arises to 
operate on the data segments. Such programs, to store and recover data, 
are both difficult to write and costly to operate. As a consequence, we have 
addressed ourselves to the task of devising methods and procedures to 
utilize the BC TRY procedures of factor and cluster analysis on large-scale 
theoretically unrestricted data sets, all without changing the basic pro- 
grams. In order to fully implement these procedures several other pro- 
grams have been written, particularly to sample sets of variables and 
objects. 

The procedure for unrestricted cluster and factor analysis in the BC 
TRY System is called BIGNV analysis. When applied to successive samples 
of 120 or 90 variables, BIGNV converges on the salient cluster or factor 
Structure of the full supply of variables. This convergent method permits 
an Inverse analysis of individuals (O-analysis, in which objects play the role 
variables play in V-analysis), also unrestricted in principle by N, the num- 
ber of subjects. The expression ''in principle" means that the theory and 
logic of the procedures remove restrictions on n and N. In reality, however, 
hardware constraints of the computer and the costs of running problems 
with very large numbers of subjects and variables put practical limits on 
the studies that can be done. 
: The domain sampling theory and steps of BIGNV are described and 
illustrated in this chapter. One set of illustrations involves two independent 


V-analyses of the 566 items that compose the ММРІ. The other set involves 
two O-analyses. The first is on the item responses o 


f 310 ad i 
took the MMPI, adult subjects who 


th ! 70 psychotic апа 150 anxiety outpatients of a Veterans 
Administration Clinic, matched for age and education with 90 Armed Ser- 
vices officers. The second is on the original scores of the Grant White 


School sample of 145 children of the Holzinger Study. The studies involved 
have been described in some detail in previous chapters 


BIGNV in the MMPI study 


The formal question is as follows: How can one solve for the cluster or 
factor structure of a full supply of 566 items when the computer program 


UNRESTRICTED CLUSTER AND FACTOR ANALYSIS 255 


is designed for no more than 120 variables? The cluster or factor structure 
in the unrestricted analysis should be that structure which would have been 
found if one did have a superprogram on a large computer that could 
handle the full supply directly. The solution to the problem, as imple- 
mented in BIGNV, is taken from the general ideas of sampling theory in 
statistics. The structure is estimated in each of a sequence of samples 
from the full supply, and the cluster structures of the samples are merged 
into a single structure. The full supply of variables is divided into manage- 
able random samples. The BC TRY procedure that selects these samples 
is called SAMPLER. The cluster structure of each of the samples is deter- 
mined by the existing BC TRY programs. The BC TRY procedure called 
MERGER is tHen used to combine the structures of the random samples 
to form an estimate of the structure in the full supply. In short, SAMPLER 
breaks down the supply into samples, and MERGER builds it up again. 
In the process the cluster structure of the full supply becomes revealed. 
It should be noted from this illustration that by using this domain sampling 
procedure we are in principle no longer constrained from tackling any 
problem no matter how large the full domain of variables from which the 
samples are to be drawn. 

The detailed procedures of applying BIGNV procedures to the MMPI 
are not spelled out here. Rather, the general outline of the four primary 
Steps in the analysis are discussed along with the results of the analysis. 
The BC TRY User's Manual gives the details. Table 11.1 shows the main 
stages of the analysis as applied in the 120-variable system on the ІВМ 
7094. The implementation of these stages in an initial V-analysis of the 
566 MMPI items is illustrated in the subordinate steps under the four 
Stages. In stage 1 the SAMPLER procedures break down the 566 items into 
five random samples of items, each sample containing 120 items. The com- 
munality of each variable in a sample is calculated within that sample. A 
new sample is drawn, composed of the 120 items, out of all 566 items, that 


have the highest communalities. This sample of items is submitted to a 
The cluster solution reveals nearly a dozen 


Standard key-cluster analysis. 
me of the dimensions are highly correlated, 


cluster-defined dimensions. So 
and others are defined by narrow, specific doublets. Combining the highly 


Correlated clusters and deleting the trivial doublets in the analysis produce 
a hierarchical condensation that yields a salient set of three pivotal dimen- 
Sions. These dimensions are defined by three clusters of items readily 
identified as the introversion (1), body symptoms (B), and suspicion (S) 
clusters introduced in earlier chapters. 

In stage 2 a concern is the further elimination of items having trivial 
communality. Items with the highest communalities are the ones most 
likely to form clusters (or to be saturated by factors). On the other hand, 
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TABLE 11.1 BIGNV PROCEDURES OF ВС TRY APPLIED ТО V-ANALYSIS 
(FIRST TRIAL ON 566 MMPI ITEMS) 


Preliminary 


Input by the data processor of raw scores on the full supply of 566 items 


Stage 1. Sampler: Breakdown of the Supply of n Items into Samples of 120; Determination 
of the Pivotal V-dimensions from Them 


Selection of five random samples of items 

Calculation of communalities of all items in their samples 

Selection of the 120 most general items for pivot dimension analysis 

Discovery of pivotal cluster-defined dimensions on the most general samples, after 
hierarchical condensation: pivotal |, B, S 


Stage 2. Preset Cluster Analyses on the Samples: on Item Samples of Decreasing Gen- 
erality (Size of Соттипа у) 


Selection of five item samples of decreasing communality; rejection of items of trivial 
communality (200) 


Preset factoring and structure analysis of the four most general samples on the three 
pivotal dimensions 


Testing the sufficiency of the pivotal dimensions in the samples 


Stage 3. Көре Synthesis of the Cluster Structure of the Supply of n Items from the 
amples 


Rejection of additional specific items of trivial communality (117) 
Formation of a composite statistical structure and geometric configuration of the 
Supply from the samples (249) 


Stage 4. Рес 


cription: Nature of the Cluster Structure of the Full Supply of n Items 


Final decision on three pivotal and four dependent oblique item-clusters from the 
composites and from auxiliary sector analysis 

Rejection of rationally ambiguous items (57) 

An abridged representative structure of 118 of the "best" 
clusters 

Conceptualization of the seven MMPI item-domains: pivotal |, 
D, R, A, T (retained items: 192 or 34%) 


items of the seven oblique 


B, S and dependent 


an item that shares little or no variation with any other item cannot possi- 
bly be a defining variable of any cluster or enter into a factored general 


dimension. Accordingly, in stage 2 of Table 11.1, 200 items of trivial com- 


munality are first deleted, including the 16 duplicate items of the ММРІ. 


Remaining items are then regrouped into samples of decreasing com- 
munality. Defining items for the three pivotal dimensions are included in 
each of the samples as markers. Cluster analyses are carried out on these 
samples, with solutions preset on the common set of three pivotal dimen- 
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sions. The sufficiency of the three pivotal dimensions is determined by 
inspecting the residual correlations. 

In stage 3, an additional search for items with trivial communality 
(now with respect to the common dimensions) is carried out. Another 117 
items are removed as a result of this further search. Since the item sam- 
ples are factored on common dimensions in stage 2, in stage 3 the MERGER 
procedures consist of simply forming a single composite cluster structure. 
A surviving pool of 249 items (all with salient communality) is involved in 
this procedure. 

With the composite structure of the full supply revealed, the stage 4 
phase consists of making a final decision on the salient clusters in the 
structure. Careful study leads to the selection of seven clusters; the three 
pivotal dimension definers and four additional clusters, depression (D), 
resentment (R), autism (A), and tension (T). In this process 57 rationally 
ambiguous items are lost. As a final abridged sample, the 118 items that 
best define the seven clusters are chosen. A single dimensional analysis 
of these items is then carried out, thus yielding, in one computer run, a 
representative structure of the full supply of MMPI items. The seven item- 
clusters are listed in Table 11.2. 

The first three clusters, labeled | introversion, B body symptoms, and 
S suspicion and mistrust, are the pivotal three that defined the tridimen- 
Sionality of the analysis. For each cluster the contents of the best 17 items 
only, paraphrased for simplicity, are shown. They are best in the sense 
that they have the highest factor coefficients, i.e., the highest correlations 
The 17 contents in the introversion cluster 
y are all symptoms of social withdrawal. 


with their respective clusters. 
rather compellingly reveal that the 
These 17 are the abridged set of items that, in stage 4, most clearly reveal 
conceptually what the cluster means. At the foot of the list, under Other, 
are shown the item numbers and factor coefficients of nine additional 
items in the introversion cluster that fill out the full cluster. All are also 
introversive symptoms. The nine other items are less distinctive than the 
17 above them in the special sense that they show a bit more correlation 
With some ot the нанете sig Өе then the 17 do. Ts 27 sre sharply 
differentiated in terms of correlation from the items that compose the 
other six clusters. Finally, the reliability coefficients are marked Reliability 
at the foot of the introversion listing. The full cluster of 26 items yields a 
Composite cluster score that has an a reliability coefficient of .925. Even 
for the lesser number of 17 abridged items, a composite score has a 
reliability with the high value of .911. 

The reader may review the other six clusters in detail. Generally, 
they have sharp meanings rather close to the titles assigned them in Table 
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11.2. Cluster scores on all have high o reliabilities, of the order of approxi- 
mately .90. The reliabilities are much higher in general than those found 
for the standard nine clinical scales, despite the fact that the scales are 
composed of many more items than are these clusters. 

A graphical picture of the cluster structure of the MMPI items is 
provided in the geometric diagram of Fig. 11.1, which is a map of the corre- 
lations among the items. It is a tracing of the printout of the component 
SPAN of BC TRY, produced as the final outcome of a single computer run 
of a full cycle key-cluster analysis of the 118 best items for the seven 
clusters. It should also be noted that the items selected to define the axes 
in the figure are the three pivotal clusters, which are introversion, body, 
and suspicion, circled as shaded domains of correlation. Though the three 
factored dimensions are perforce orthogonal (the second and third being 


TABLE 11.2 MOST DISTINCTIVE DEFINING ITEMS ОҒ MMPI 


CLUSTERS (FIRST TRIAL) RANKED BY OBLIQUE FACTOR 
COEFFICIENTS Fc 


No. Content Fe 
C, l, 
introver- 
sion 
(pivot): 
377 | Sit alone at parties -70 
— 57 | Poor mixer 66 
321 | Easily embarrassed 66 
200 | Shy 65 
180 | Poor conversationalist 65 
—371 |Self-conscious 65 
267 |Hard to talk in group 62 
172 | Bashful 61 
86 |Lack self-confidence 61 
171 | Poor at party stunts 61 
—547 | Do not like parties 60 
—521 | Hard to talk in group 59 
52 Pass friends without speaking 58 
—309 | Don't make friends quickly 58 
—479 | Mind meeting strangers 56 
509 Сап" stick up for self | 53 
292 | Don't speak first | 46 
Other: 


Мо, | —79 317 | —264 | 138 | —353 | 304 | —449 | 15 ЕС 


ЛЕ aa) Bie чай] St | | Gaz ЕТ ЕТЕ 
Reliability: For = 26, 925; n = 17^, 911 


4 Denotes abridged set of best defining items. — 
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derived from residuals), the three clusters of variables that define the 
dimensions are themselves not completely independent. They are slightly 
oblique to each other. Note also that they encompass the other form clus- 
ters, D, R, A, T, which are therefore called “dependent” clusters. This 
means that actual composite scores on these four dependent clusters can 
be predicted with fair accuracy from cluster scores on the three pivotal 
clusters, І, B, S. Table 11.3 presents іп metric form most of the facts герге- 
sented pictorially in the geometry of Fig. 11.1. The actual magnitudes of 
the intercorrelations between the clusters are given in the correlation 
matrix of Sec. A of Table 11.3. In Sec. B of Table 11.3 is shown the index 
of overall generality of each of the seven clusters, i.e., the proportion of the 
total squares of raw correlations over the whole original correlation matrix 
that is reproduced by the sums of squares of correlations by the given 
dimension taken singly. It was the fact that the tension cluster had the 
highest reproducibility ratio of .88 that partly disposed us to include T as a 
fourth cluster in O-analysis, below. The very high а reliabilities computed by 


~ 


/ А 
/ Resentment a KA 


URE 11.1 
унтини ИЙ cluster structure by SPAN of the 566 ММРІ items, 
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the CSA component and mentioned earlier in Table 11.2 are also given in 
Sec. C of the table. 

One significant finding from this cluster structure is that psychiatric 
symptoms as measured by the MMPI compose an oblique structure. That 
is, psychiatric symptoms are generally positively correlated. Furthermore, 
it is found that the most independent pivotal three clusters, introversion, 
body symptoms, suspicion, refer respectively to disorders of self, disorders 
of body, and disorders of mind or thinking. 

It may legitimately be questioned whether the estimation of the 
cluster structure of the full supply from random samples is reliable. For 
evidence on this point, the first three stages of BIGNV were repeated on 
new sets of random drawings. The composite structure on stage 3 of this 
second trial was virtually identical to that of the first trial for the two dimen- 
sions of introversion and body, but the third dimension was a fuzzy mixture 
of suspicion and resentment items. It was known from the first trial that 
S and R were quite oblique and could be statistically combined, so there 
is in fact no inconsistency in the results of the two trials. The other three 
dependent clusters, D, A, and T, showed about the same differentiation 


TABLE 113 INTERCORRELATIONS, GENERALITY, AND RELIABILITY 
OF THE SEVEN OBLIQUE CLUSTERS 


A. Intercorrelations* 


Џ B 5. D R A m 
c | ex -— € PR ELE 

! 47 | 27 | 273 | 45 | 41 | 68 
B 36 32 | 4 | M | 51 | 75 
$ 28 | .35 32 | 54 | 54 | 48 
D go | 5 | -37 64 | „57 | .78 
n 57 15 [66 | -74 '59 | 69 
А PTS EE рд | 472 56 
T ва | 66 | а | -87 | -8 | 7 


B. Generality 


Reproducibility of 
the 13,806 raw 


correlations 65 55 Al .83 72 67 | .88 
C. Reliability БЫ 
— PONN p He — —M 12 
8 = 17 91 89 83 901 | 82 | 81 | 88 
s>17 | | | 


Ghdadpg ehe?) | өн| 9 | S | M | 89 | 86 | .92 
| | | 
| | | 

| 


" Above diagonal, raw correlations; below, interdomain (common factor) correlations 
(8 = 17). s = number of items per cluster (cluster R has s — 16). 
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as in the first trial. Since the results of the second trial confirm those of 
the first, we can accept the seven clusters of Table 11.2 as the final cluster 
structure of the MMPI. 


ВІСМУ in O-analysis 


BIGNV applied to the isolating of profile types of persons in O-analysis is 
identical in broad design to BIGNV in V-analysis. The similarity is clearly 
shown in Table 11.4, where the same four stages are given as they apply 
to two separate O-analyses. At the left side of the table is the O-typing 
of the MMPI profiles of 310 psychotics, neurotics, and normal adults; at 
the right is the O-analysis of the achievement profiles of the 145 children 
in the Holzinger problem. 

The MMPI person analysis is considered first. Each person is scored 
on the three pivotal clusters, |, B, and S, and on one of the dependent 
clusters. The reason for choosing an additional (dependent) cluster is that 
the three pivotal dimensions are not completely sufficient. The fourth 
cluster T, measuring tension, is selected because it is the largest of the 
dependent clusters, and, as shown in Table 11.3, it is the most general 
of all seven clusters. It also comes nearest conceptually to what is generally 
known as “anxiety,” believed to be a central attribute in many psychiatric 
disorders. 

In stage 1, a standard EUCO analysis, as described earlier, is per- 
formed on random samples of 120 subjects from the full supply of N. In 
Stage 2 the full supply of N subjects is divided into random samples (three 
in the MMPI study, two in the Holzinger study). The cluster structures of 
these samples are determined on the preset dimensions found in stage 1 
and tested for sufficiency. In stage 2, dummy or marker individuals are 
Projected into the analyses and thereby the cluster Score axes can be set 
into the configuration of persons so that the scores of any one of them 
can be read off to a close approximation. 

In stage 3, MERGER procedures synthesize the cluster structure of 
the full supply from the samples, as in V-analysis, The results of MERGER 
are shown in Figs. 11.2 and 11.3. Plotted in Fig. 11.2 as points on the 
surface of a sphere аге the persons whose intercorrelations (or interspace 
distances) can be accounted for by the first three pivotal person dimen- 
sions. The defining persons of the first three pivotal dimensions are 
circled as the three groups labeled Cı, С», and C; in Fig. 11.2. In order to 
see concretely what this figure means, the reader should look ahead at 

Fig. 11.4, where he will find the score profiles of these three groups of 
persons drawn in the orthodox way on scales with a mean of 50 and standard 
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TABLE 11.4 BIGNV PROCEDURES OF BC TRY APPLIED TO O-ANALYSIS 
(MMPI AND HOLZINGER) 


Preliminary 


Holzinger, 145 Children 


MMPI, 310 Adults 


Calculation of cluster scores on I, B,S,T Calculation of cluster scores on F,V, S M 
Calculation of six scattergrams between Same as MMPI 
the four cluster scores 


Stage 1. Sampler: Breakdown of the Supply of Subjects into Samples of 120, and Deter- 
mining the Pivotal O-dimensions from Them 


Selection of a random sample of 120 Same as MMPI 
subjects 

Discovery of four pivotal O-cluster 
dimensions on the sample 


Same as MMPI 


on Stratified Random Samples of Objects 


Stage 2. Preset Cluster Analyses on the Samples: 


Selection of three stratified samples of Selection of two stratified samples; no 


persons; no rejections rejections 
Preset factoring and structure analysisof Same as MMPI 
the three samples on the four pivots 
Same as MMPI 


Projection of dummy model markers and 
rational markers into the structure 
(OMARK) 

Sufficiency tests of the pivots in the 
samples 


Same as MMPI 


Stage 3. Merger: synthesis of the Cluster Structure of the Full Supply of У Persons from 
the Samples = 


Formation of a composite structure and Same as MMPI 


configuration of the supply from the 
samples 


Stage 4. Description: Nature of the Cluster Structure of the Full Supply of N Persons 


Decision on 14 Core O-types 

O-clusters: allocation of all 145 persons to 
the Core O-types (EUFIT) 

Level and homogeneity of the 14 O-cluster 
profile-patterns (OSTAT) 


(To be reported in a monograph by Stein 
and Chu) 


deviation of 10. It should be noted that groups Ci, C», and C; form very 
tight profile groups. In short, the SPAN diagram of Fig. 11.2 provides 
a total map of the similarities and differences in profiles among all the 
persons plotted on the sphere. 

Some of the person points have negative signs in front of them, 
meaning that their profiles are mirror images, a "reflection," of those 
of persons next to them who do not carry negative signs. This can be 
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Tension 


E 40 


FIGURE 11.2 
Cluster structure of individ 


uals whose score profiles are described at more than 90 percent 
sufficiency on sphere Sın. 


Seen concretely in the profiles of the reflected group labeled —C, in Fig. 
114. Just as those of Сі are uniformly below average on the four cluster 
Scores, those of — C, are uniformly above average, though more extreme. 

The scaled cluster score axes have been drawn on the surface of 
the sphere so that the scores of the persons can be observed. In Fig. 11.2 
they are labeled Body, Tension, and Suspicion. Because of the curvature 
of the surface, the Projections of persons on the 
distorted in the plane of the paper. However, it will be observed that 
cluster C; is below average оп B, S, and T and that Cluster —C, is above 
average on B, S, and T and farther from the average but in the reflected, 
i.e., above average, position. 


5e axes are somewhat 


The significant finding of the O-analysis 8065 directly to the matter 
of validity of the MMPI clusters and of the O-types Produced on them. The 
normal officers are abbreviated as O's in Fig, 1L2, the psychotics as 
P's, and the anxieties as A's. The officers generally cluster around Су, 
whereas not a single officer lies among the psychiatric cases in the 
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Introversion 


P P Tension 
35 


40 


FIGURE 11.3 
Cluster structure of individuals where score profiles are described at more than 90 percent 


sufficiency on sphere Sızı- 


reflected zone of — Ci. Furthermore, some of the officers and none of the 
psychiatric cases form cluster Сз, high on the suspicion and mistrust di- 


mensions (and aggression) though still low on the body and tension dimen- 
sions. Perhaps it is “normal” for the Armed Services to include some 


officers who have such a belligerent, healthy, and relaxed pattern. 


FIGURE 11.4 
Cluster score profiles of the individuals shown in person-clusters, Сл, C», Cs, and 


Сі of the SPAN spheres. 
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The next sphere, Fig. 11.3, brings in the introversion, body, and tension 
dimensions. Unlike the preceding figure, except for the few persons in 
cluster C,, there are virtually no persons high on the introversion dimension 
and also low on the body and tension dimensions. There are, however, 
quite a fair number high on all three clusters, and these are composed 
almost exclusively of psychiatric cases. 

In the description phase, stage 4, the objective is taxonomic, namely, 
to study the configuration of the full supply, such as that shown in Figs. 
11.3 and 11.4, and to select a set of contrasting collinear core O-types that 
is representative of the complete structure. In choosing the final core 
lypes, the analyst will naturally orient to loci in the configuration where 
there are heavy concentrations of individuals. These are empirical or 
“natural” clusterings, i.e., types of individuals whose higher frequencies 
of special score patterns on the dimensions ensue from selective biosocial 
forces. Lacunae in the configuration signify "missing types,” biosocially 
incompatible, disjunctive patterns of scores on the attributes. This 
search for “natural clusters” is precisely the same objective as that of the 


recently emerging field of biology called "numerical taxonomy" (Sokal 
and Sneath, 1963). 


Purposes of taxonomic analysis. When 
© be unduly arbitrary, nothing, indeed, 
ecting alternative sets of core types for 
his purpose. 

allocate every individual in the full supply 
has a score Pattern that fits closest to 


the sele 


ction of core types appears t 
should 


Prevent the analyst from sel 


particular core type that 
his own. The procedure for doing 


by this allocation proces. 


5 are termed the 
viduals fit in O-clusters; 


Chapter 12 


STATISTICAL THEORY AND COMPONENT 
PROGRAMS OF BC TRY 


[ПЁ chapter is a brief description of the statistical and logical basis 
of the BC TRY System and of the programs themselves. We do not 
attempt to develop here the entire theory and practice of cluster and factor 
analysis, our aim being the more modest one of illustrating the major 
theoretical and practical aspects of the statistical and logical components 
of the System. A number of general texts contain descriptions of the pro- 
Cedures and mathematics of some of the orthodox factor analysis com- 
Ponents of the System. These subjects are given relatively scant attention 


here, deference being paid to other authors on these topics. 
The plan of this chapter is to describe each major statistical compo- 


nent of the System, one at a time. Each component is described in terms 
of the general statistical and logical theory behind the component. Where 
it is not obvious, the programming of the component is described at the 
level of a general algorithm that is machine independent. It is not intended 
to exhibit the details of the component programs of the System here, The 
actual programs (at the time this was written) make up some 60,000 
Fortran statements. The program listings themselves are equivalent to 
Several volumes the size of this book. As a consequence, it must be obvious 
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TABLE 121 DST INPUT-OUTPUT TABLE 


Programs Which Write DST 


Data processor program, DAP2 
Euclidean distance program, EUCO2 


Programs Which Read DST 


Correlation program, COR2 
Correlation/covariance program, COR3 

Cluster and factor scoring program, FACS 

Object typology program, OTYPE (optional) 
Typological prediction program, 4CAST (optional) 


that this chapter and the next are sufficient to give only a general idea of 
the operation of the components and the logic behind them. The detailed 
descriptions of the data, input and output, component programs, and 
details regarding the control of the System by the user constitutes the 
BC TRY User's Manual and will not be repeated here in any but the most 
general form when it is useful in explaining the logic and implementation 
of a program. Since the input and output for BC TRY components are 
fundamental in discussing the components, we reproduce here, as Tables 
12.1 and 12.2, the DST and IST I/O tables. Chapter 13 is an abridged 


User's Manual, Showing the detail of BC TRY applications in standard 
problems. 


Cluster structure analysis 


A cluster of variables is merely a composite or a grouping of variables. In 
BC TRY the clusters are selected in an analysis of multivariate data to 
Satisfy certain criteria. However, in general, 
logically a cluster. Clearly, 
in the population of variabl 
Structure analysis is the 


any group of variables is 
Some clusters represent domains of variation 
es or objects better than other clusters. Cluster 
means whereby the statistical properties of clus- 
ters are studied and statistical relationships of clusters and domains can 
be studied. The main correlational properties of clus 
linear composites of scores of variables in clusters, have long been known 
(Spearman, 1913; Guilford, 1950), 


ters, particularly 


The important statistics one wants to know about cluster com posites 
(cluster scores) are (1) their reliability, (2) the validity with which they 
represent some (unspecified) domain, (3) the correlations between differ- 
ent cluster scores, (4) the correlations between estimates of the domains 
the composites represent, (5) the correlations of each of the observed 
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variables with the cluster composites, (6) the correlations of each of the 
observed variables with the domains represented by the cluster com- 
posites, (7) the generality of each oblique dimension, and (8) data on 
expansion of clusters with respect to reliability. 

All that is required for a cluster structure analysis is a set of .V V vari- 
ables, the correlations (Pearson product-moment) among them, and the 
clusters. The nature of the clusters is immaterial, as is the method of 
selecting them. The clusters may be selected to satisfy rational-theoretical 
considerations or certain statistical considerations as in factoring, etc., or 
they can be arbitrarily selected. The analysis is applicable to any collection 
of subsets of the ХІ” variables, all the variables, or only a portion of them. 


TABLE 12.2 IST INPUT-OUTPUT TABLE 


| | | 

Рго- | | 

gram IDFILE | ММАМ51 | VSUMSI | MEANSI | VAREN1 | STDEV1 | CORRM1 ОМАМ51 

28 | | | = = | == = = 

Бе: (а " 2 = Е 3 = | 

ПАР? о o o о | 0 o o 
COR2 | І І о о 
COR3 І І І (1)(0) (0) 
ВЕРЕ | | | 
RLIST І І 1 1 
SLEPL 1,0 1,0 
DVP | 1 
FALS Џ 
NC2 | І 
(962) | 
SLEP2 І 1,0 1,0 
GYRO 1 
NCSA2 І а) 1 
С5А2 а) (1) 
FAST 0] (X0) 
SPAN2 | a) 
FACS І 0] [ [0] | 1 
EUCO2 I о о о о 
СОМРІ 1 (I) 
COMP2 о о о 
RSCAT | (I) 0) 
ОТҮРЕ І (I) (0) (1) | 
ОЗТАТ | | | | 
4CAST | ф | 0) 20 | 
Notes: = input for all control options; (1) = input for some control options; О = bute 
Put for all control options; (0) = output for some control options. All files can be 


input and гог output by GIST and GIVE. 
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TABLE 12.2 IST INPUT-OUTPUT TABLE (Continued) 


2MEANI 
| 25ЕМ51 
Program | 2STDEV1 МЕАМ№51  STDEV2 СОУАН1 REORD1 CORRM?2 DIAGV1 


DAP2 | 
COR2 


COR3 (0) (0 | (0) (0) 
REDE | 


RLIST 1 pM ! | | 
SLEPI 

DVP | о 
FALS | | | 


NC2 
CC5 | | 
SLEP2 | 

GYRO | | 


| 
NCSA2 | | 
CSA2 | | 
FAST | | 
SPAN2 


(0 


FACS | 
EUCO2 

СОМРІ 
COMP2 | 


() 


RSCAT 
OTYPE 
OSTAT 
4CAST 


Also, one may apply the methods of 


"reproduced" correlations from a facto 
ture even though the Correlation matrix 

The basic model for cluster structu 
model that is utilized throughout BC TRY 


is not available, 


the sample of variables in a study is identified à 
Жа) А Ка, . 


А subset of these observed variables is Selected a 
cluster out of K clusters, and designated as 


S 


<- Хуу 


5 а cluster, say the ith 
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TABLE 12.2 IST INPUT-OUTPUT TABLE (Continued) 


Program UFACT1 CLUST1 | REFLX1  RFACTI BASIS1 | FSCOR1 кени 


| | 
| | 
ФАР? | | | 
совг 
COR3 
REDE 


RLIST | | | 
SLEP1 | | | | | 
DVP | | 
FALS | [9] | | | | 


NC2 о | (0) о | 
CC5 ON 0 (0 | 
SLEP2 | 10 | (XO | (Xo) Er Ut 
GYRO І 


(ІІ 9 0 
NCSA2 a} за co ЖҮ 
С5А2 (1) 
ҒА5Т | | 
SPAN? (1) a | 0 


ЕСУ СС (ho 
FACS @) (1) (1) | ( 
EUCO? Ше e 
COMPI (1) | (9 | 
COMP2 


RSCAT | n 
OTYPE | 

OSTAT | 

4CAST | 


e ndicates the number о i i 1 luster. e several 

"ne S d h b f variables in the ith c 

у indicates t h Th 

е, us ге Бу ѕеїѕ of variables out of the total set of observe 
a d 


es ће с ters m r i о special pro- 
variabl lus S ay have co on variables, but р! 
» е 


і is i re clusters and we 
Cedures are involved when a variable is in two or more 


need not ke special note of the fact. A complete list of the A clusters 
ed not make 


might appear as follows: 


Cluster 1: Vin Vis + + - а 
Cluster 2: Va, Vos · · · 2 ч ds 
Cluster т: Ка, Vie , Vis 


Cluster К: Vii, Үк» -~ > Укзк 
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А cluster composite is simply the sum of the values of the cluster 
variables 


Gt aspe Қазір 0 us E Vis, (12.2) 


It is assumed that the variables are in standard form, i.e., with mean 
of 0 and standard deviation of 1. This assumption is not necessary, but it 
simplifies the notation and equations. Since the desired results are in 
terms of correlations, the standardization of the variables is implied from 
the first even though the equations could be stated in terms of the obtained 
means and standard deviations. 

Corresponding to cluster С, there is a cluster domain and а set of 
variable domains from which the observed variables were sampled. A vari- 
able domain is defined as a very large set of variables that are theoretically 
observable, but unobserved, and precisely “parallel,” or collinear, with the 
corresponding observed variables. Thus, for variable Х, the variable 
domain is defined by the set 


ARAS AU LG. Loa ar (12.3) 
The domain composite for the variable domain is defined as 


Ds — Xi 5 Хо (12.4) 
g-l 
The domain composite for a cluster domain is defined as the sum of the 


variable domain composites corresponding to the variables defining the 
cluster. Thus, for cluster ; 


D; = @; + У У үш (12.5) 
The definitions of cluster composite, domain, domain composite, etc., 
just outlined, permit some rather far-reaching statements about the cor- 
relation characteristics of clusters and domains. These characteristics can 
be made quantitative from knowledge only of the clusters, the intercorre- 
lations of the observed variables, and the "соттипајћу" of the variables. 
For the time being assume that the diagonal values of the correlation 
matrix are filled with the communalities of the variables 
Hi A AES сє. йуу 
or, In an alternative notation 
xor A 54 ‚ ТХмУХну 


n reference | Я s А 
Whe ence 15 made to the communality of some variable in a cluster, 
an alternative notation is 


STATISTICAL THEORY AND COMPONENT PROGRAMS OF BC TRY 275 


for the ith cluster and the jth variable in the cluster. The domain sampling 
interpretation of these values and means whereby they are obtained are 
presented later in this chapter. 

Domain validity 


The domain validity of a cluster composite can be interpreted as the 
accuracy of the cluster score in estimating the cluster domain composite. 
The domain validity of the cluster composite C; in estimating the cluster 
domain composite D, is the correlation of D; and C; 


EC S 
[em 
ісікті 
"р,с, ЕСТЕМЕС = a 
" Y > TY Vie + » (1 == hy.) 
= 


j=l кті 


(12.6) 


Reliability of the cluster composite 


The Spearman-Brown reliability of a cluster composite is given by the 


Square of the domain validity 


Toc, = Tres 


(12.7) 


where С” is a parallel domain (see Tryon, 1957a, and Ghiselli, 1964). 


Cluster composite intercorrelation 


given by 


The intercorrelations among cluster composites are 


(12.8) 


The diagonal elements rrr, ANd "rre where y — k, are assumed to 
be unities in this equation rather than communalities. In orthodox factor 
analysis this coefficient is known as the "correlation between oblique 


factor estimates." 
Cluster domain intercorrelations 


The intercorrelations among cluster domains are the intercorrelations 
among cluster composites, Eq. (12.8), except that the diagonal elements 
are the communalities of the respective observed variables. Equation 
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(12.9) is the same as (12.8), except that the diagonal elements are unities 


in (12.8) and communalities in (12.9). The interdomain correlations are 
denoted 


D,D, (12.9) 


A rondo of relationships exists among the correlations already indicated. 
This is given in 


Теје, = ФРрвјђевтев (12.10) 


The intercorrelation of cluster composites is equal to the cluster domain 
intercorrelation reduced proportionally by the invalidity of the cluster 
composites as measures of the respective domains. 


Correlation of cluster composites 
and cluster domains 


These correlations are simply the interdomain correlations reduced pro- 
Portionally by the invalidity of the cluster composite 


l'en, = D,D, D.C, (12.11) 
Correlation of variables with cluster domains 


The correlation of an observed variable with a cluster domain is known as 


the “oblique factor coefficient" of that variable with regard to the domain. 
All diagonal elements are communalities 


"X D, = 


(12.12) 


g-lk-l 


Augmented oblique factor coefficient 


This coefficient is the correlation between the cluster domain and the 
domain represented by a single variable /), 


aap 00) (12.13) 
ч ћу D 
The augmented factor coefficient is the correlation between the variable 


and the domain, adjusted in accordance with the communality of the 
variable. 
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Generality of each oblique dimension 
One of the major points made in discussing factoring is the degree to 
which the orthogonal factors account for the intercorrelations among the 
NV variables ог for the communality of the .V V variables. In cluster struc- 
ture analysis, since the clusters are not orthogonal, in general, these 
considerations must be made separately for each of the cluster dimensions. 
The generality across orthogonal factors is cumulative, but the generality 
across nonorthogonal cluster dimensions is not. 

The generality of an oblique cluster can be represented by either of 
two indexes: (1) the degree (proportion) to which the cluster dimension 
accounts for the intercorrelations among the N V variables or (2) the degree 
to which it accounts for their common variances, their communalities. 
"Account for” means the degree to which the intercorrelations or com- 
munalities can be reproduced by the given cluster dimension. The formulas 
are given below. 

The cluster domain D; can be represented as a variable or dimension. 
If the domain D, represents all that is general to the relationship between 
two observed variables, say X; and Xx, the partial correlation 

PX; Xe — "X,D," X4D, (12.14) 


PX Xe ро 5 - 
VN, DIXD, 


will be exactly zero. When the partial correlation is zero, the correlation 
between the two variables is equal to the product of the correlation of the 
Variables and the domain. That is, 

when /x,xeD, = 0 (12.15) 
.15) will not be equal to the left-hand 
is a measure of the degree to which 
with D, accounts for the correla- 
"reproduced correlation” 


= rX,DJ'NiD, (12.16) 
The more nearly r%, х. equals "х,х,, the more nearly does D, account for 
the correlation of the two variables. In taking a general index of the degree 
to which /), accounts for the correlation among the У] variables the 
respective coefficients are first squared and then the squares are summed 
Over the entire set of variables and a ratio of the sums of squares formed 


ж = l'X,D, X4D, 
In general, the right-hand term of (12 
term. The right-hand term, however, 
the domain correlation of the variables 
tion of Y, апа Ху. The quantity is known as the 


' 
UN Ne 


NY XV үу xr 
Y x (ухо > $, (ry pI xD)” 

" пита отв 

I$ = ум x XT X (12.17) 
у У гухи У > Куме 
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To the degree that T? approaches unity, the domain D; has generality 
across the entire set of observed variables, with respect to their inter- 
correlations. The diagonal values in the correlation matrix in (12.17) are 
the communalities obtained from a BC TRY factoring component, such 
as CC5. 

The communality 15, of the variable X; is the portion of the variance 
of X; predictable from the other ХІ — 1 variables in the study. In the 
degree to which the squared domain correlation rx, p, Of a variable on a 
domain is equal to the variable's communality, that domain accounts for 
the generality of the variable in the set of V V variables. In applying this 
to all NV variables for a given domain, the total generality of the NV 
variables is the sum of the communalities, and the degree to which this is 
accounted for by a domain is the sum of the squared domain correlations 


[еге (12.18) 


The degree that l’? approaches 1.00 is an indicator of the generality of the 
domain D, over the set of V V variables with respect to their communality. 
This ratio is adversely affected as an adequate measure of generality to 
the extent that the estimates of communality used are inadequate. 


Expanding a cluster by adding variables 
that maximize the reliability coefficient 
of the new cluster composite 


If one adds to the definers of a cluster additional variables with high factor 
Coefficients on the cluster dimension, the reliability of the expanded 
Composite may be increased. When such an increase can be made, how- 
ever, there is a maximal value of the reliability coefficients of the expanded 
composite beyond which further additions of variables can give reliabilities 
less than the maximal value. The problem is to discover what additional 
variables will contribute to the maximizing of reliability and what the optimal 
value will be after including them in the composite. 


The reliability of a cluster composite is given by Eq. (12.7). If the 
cluster is expanded by a single new variable, say V, 


Y , the equation can be 
stated ((” is the expanded ith cluster and (7, is its parallel) 


1-1 


Y Y пута + Ё 42 Y ғұ ji. 
аа Е (12.19) 
Uca) а 275 rins] 


nes 


У y ret Y 
~ = " 


1 
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The reliability of the extended cluster ( depends on the relationship of 


the bracketed parts in the numerator and denominator. If the ratio 
dig 2 У fü, 
- fry i 
Wie : (12.20) 


10 +2 > түу 
j=l 
has a value exceeding the reliability of the unexpanded cluster, the relia- 
bility of the expanded cluster will be greater than the reliability of the 
unexpanded cluster. Whenever the ratio (12.20) is less than the reliability 
of the unexpanded cluster, expansion of the cluster will not increase the 
reliability. Thus W., is the critical value of the variable V, with respect to 
the expansion of the cluster, by adding V,. 

CSA uses this development in "expanding" clu 
which is not in the cluster but which has its highest oblique factor coeffi- 
cient on the cluster is called a “potential expander." The potential ex- 
panders are "added" to the cluster one at a time in the order of the mag- 
nitude of the (12.19) as evaluated for the single variable expansion. 
Equation (12.19) is evaluated next for the progressively expanded cluster. 
These statistics, and that given as a lower bound (12.22), are used by the 
analyst to determine a reclustering if needed. The reclustering is not 
done by CSA. 

Equations (12.19) and (12.20) are written in terms 
among the observed variables. For computational p 
equation is based on the factor coefficients 


rv.) ka (б Эн 
қ Е зі --- (12.21) 


g= y hè, + 10 + 2710) mun) + (У ыы)” 


vat 


sters. Each variable 


of the correlations 
urposes the best 


5, 
hi + 2ry,p, (2, 
j=l 


јел 
Theoretically, the critical value of a variable as a potential expander of a 


Cluster is related to the factor coefficient of the variable with respect to 
that cluster dimension. Re-expressing (12.20) in terms of the factor coeffi- 
cient, setting (12.20) equal to the reliability coefficient of the cluster, and 
Solving for the root of the quadratic equation implied by the expression 
gives a lower bound for the factor coefficient of a variable that can be 
added to the cluster and increase the reliability of the cluster. This lower 
bound of the factor coefficient of a variable Х, can be calculated by 


as 
+ VO то) — ree) ree, 0222) 
=l 
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If the variable Х, has a factor coefficient гх,р, equal to or greater than the 
LBFC, it should enhance the reliability of С", when С, is expanded to include 
it. This lower bound is only theoretical, and the empirical method of trying 
the potential expanders is used in the program, although the LBFC is 
output for each cluster. 

When an added variable carries a negative sign on its factor coeffi- 
cient with the defining cluster, it must be reflected. This means that its 
factor coefficient becomes positive and that all its correlations with definers 
and other added variables must have signs changed. Reflection of variables 


is discussed in more detail in the section below on cluster analysis 
procedures. 


Miscellaneous statistics and analyses 


within cluster structure analysis 


A number of statistical and logical analyses of secondary interest are 
performed by the BC TRY component. They are listed here with brief 
explanations. 

The augmented correlation between observed variables is the corre- 
lation coefficient divided by the product of the Square roots of the com- 
munalities of the respective variables 


TAY. 


грух, = L (12.23) 
lix ix, 


These coefficients represent the correlations of the variable domains for 
all ХІ observed variables. 

All-objective, nontheoretical description of the cluster structure in a 
Sample of observations can be achieved by reorganizing the raw correla- 
tion matrix. The rows and columns of the matrix are simply arranged so 
that variables in a cluster appear together in the matrix, placing nonclus- 
tered variables next to the cluster with which they correlate most highly 
"хур, Variables are reflected to make intracluster Submatrices positive, 
and variables with communalities below .20 are given positions in the right- 
most columns and bottommost rows of the matrix. 

The mean correlation of each variable with the definers of each 


Cluster is also calculated by CSA. These values can be calculated without 
resort to the raw correlation matrix by the use of 


y 


Ж ED 


a (12.24) 
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The BC TRY component CSA 


The purpose of CSA is to apply the equations given above to clusters of 
variables. The clusters may come from a form of cluster analysis, or they 
may be input through the GIST option of the system. The limits of the 
r to the limits on the cluster analysis components 
of the system: 15 clusters, each consisting of up to 20 variables. 

In addition to the correlation information about the cluster com- 
posites, cluster domains, variables, and variable domains, other informa- 
tion is also generated by the component. A complete list of the IST input 
and output of the component is given in Table 12.2. The output to the 


Printer is as follows: 


component are simila 


Defining variables of the clusters 
Correlations between cluster composi 
Domain validities of cluster composites 
Correlations of variables with cluster domains 

5 Generality of each cluster: the factor contribution t 
the variables and to the correlation among variables 

6 Expanded cluster characteristics: reliability 
each additional variable is included in the cluster, 
Variables are included incrementally 

7 The clustered correlation matrix (optional) 

8 Theaugmented correlation matrix or correlations corrected for uniqueness 


(optional) 


tes and between cluster domains 


зоо mon 


o the communalities of 


of each cluster, reliability when 
reliability when the additional 


Noncommunality cluster structure analysis 


The domain sampling model for CSA depends on the exact collinearity of 


the variables in the domain. This assumption leads to the requirement 
that the communality of each variable be available, or at least estimated. 
A second domain sampling model avoids the necessity of having com- 
Munalities. The second model is called the "noncommunality" or "NC 
model." In the NC model, a cluster composite is defined in the same way 
that it was defined in CSA. However, the cluster domain composites are 
defined in slightly different ways- The domain is composed of the observed 
Variables in the cluster plus a very large number of other variables having 
the same pattern of correlations that exists in the sample of variables in 


the observed sample 
Kat Г, (12.25) 


+ 
Bez > Үн fork = S; ЬІ (12.26) 
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The variables in the last term of (12:26), thiaitiss l asare биб r& en 
Кы, are not in the observed set of variables but in the theoretical domain 
of the cluster. This definition is not unlike that of the domain in CSA except 
that in CSA the variables in the domain were collinear replicas of the 
observed variables. In an NC cluster domain the variables are not neces- 
sarily collinear but satisfy a pattern condition. The pattern of intercorrela- 
tions in the domain must be the same as the pattern in the sample to the 
extent that the mean intercorrelation in the domain must be equal in the 
limit to the mean intercorrelation in the sample cluster. 

The cluster domain correlation properties of a cluster are essentially 
the same in NCSA and CSA. The equations in NCSA generally are the same 
as those in CSA except that the mean observed intercorrelation takes the 
place of communalities or the diagonal term is dropped in many of the 
expressions. These specific results will not be repeated here since they 
are similar to the CSA results and are also presented in Tryon (1957) 
and 1958а). 

In practice, the values of the correlation properties of clusters com- 
puted in NCSA are usually very close to those of CSA. 

It is possible to define clusters with mean intercorrelations that are 
negative, particularly when the clusters are determined by some psycho- 
logical theory. When this happens, the cluster structure analysis becomes 
indeterminant and the BC TRY component NCSA suffers a program halt. 


Consequently, except in special need, CSA is recommended over NCSA 
in cluster structure analysis. 


Diagonal values 


The first Major decision to be made in performing a cluster or factor 
analysis is what portion of the observed variance among individuals in each 


of the VV attributes should be described by scores on the different clus- 
ters or factors. This portion is called the 


in the correlation matrix when those f 
Performed that require the presence o 

f The usual decision is any one of t 
diagonal elements are 1.00; (2) the por 
coefficients are used; or (3) the portio 
ties are used, denoting the amount 


common to it and to all the other N V 
attributes. 


"diagonal element” to be inserted 
orms of dimensional analysis are 
f such an element (CC5 and FALS). 
hree: (1) all the variance, hence the 
tion that is reliable, hence reliability 
n that is general, hence communali- 
of variance of each variable that is 

— lvariables that sample different 


Since the general objective of cluster and factor analysis is to replace 
the УГ variables by K composites that describe all that is general among 
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individual differences across all У] attributes, communalities are usually 
chosen as the diagonal values (though in some special studies one may 
wish to use 1.00 as diagonal values). We focus here on the methods of 
estimating the communalities of the NX V variables by various subprograms 


of DVP. 

On domain sampling principl 
to that of the reliability coefficient of a variable. The reliability coefficient 
jis the square of the correlation between Х; and a 
site of many scores that sample 


es, communality has a definition similar 


of a sample variable Х 
hypothetical domain score that is a compo 
different attributes, sample variables whose correlations with the remain- 
ing УМ] — 1 variables are collinear, i.e., proportional with those of X;. 


Specifically 
(12.27) 


hx, = Крка 


Where x, is defined as the domain dimension of the variable with a form 


similar to that of (12.3) and (12.4). This model states that Dx, is a collinear 
domain composed of scores on different attributes whose patterns of 
Correlations with the other У] — 1 variables of the study are identical. 
A second basis for (12.27) is the model that defines the communality as the 
Correlation between .V; and another collinear variable that has exactly the 
Same magnitudes of correlation with the other VV — 1 variables of the 
Study. 


The use of communalities 
Communalities are used in a number of ways in cluster and factor analysis. 
In the BC TRY System they have two important uses: (1) as an index of 
&enerality and (2) as a criterion to determine the number of cluster or 
factor dimensions. The most important use of communality of a variable 
is as an index of individual differences in the variable across the other 
VV. — 1 variables of the study. This is а purely objective use of the defini- 
lion of communality given in (12.27). Communalities are used in a number 
Of ways in the process of factoring a set of variables. The first use in fac- 
toring is to estimate the communalities before factoring and then to factor 
Until the estimates are "essentially" reproduced by the factors. In most 
analyses it is sufficient to factor up to the poin 
98 percent, of the grand total of the communalities is reproduced. Initial 
estimates of communalities here form a terminating criterion of factoring. 


The second use of communalities is to insert them in diagonal cells of the 


Correlation matrix during factoring. In this use, communalities perform a 


double role, their necessary role as diagonal elements and their optional 
role in forming a saliency criterion for terminating factoring. 
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Solving for values of the communalities 


In a given problem involving .V V variables there is in principle, a ы 
value defined by (12.27). In practice, it may be difficult to solve for : 
exact value, but a close approximation can always be found. Each type o 
solution implemented in the BC TRY companent DVP is a special opera- 
tional design formulated to secure approximations 19 (12.27). Тһеге аге 
three such designs, namely, the communality ћу, estimated from 


1 The correlations of Х, with the other NV — 1 variables 
2 Asubset of variables most collinear with Xj 
3 A salient dimensional analysis performed by factoring 


Some of the estimates are direct solutions in which the communalities are 
unknown in equations and are solved for directly. Others are indirect or 
iterated solutions in which “trial” values of the communalities are set in 
equations that permit one to solve for the communalities; then iterated 
solutions are performed until all commu 
degree of precision or until a s 
Both direct and reiterated solu 
computational designs. 


nalities converge at Some specified 
рестед number of iterations are made. 
tions are available under the three main 


The quadratic formula, QF 


The variable Y; is predicted by the regression of X, on the remaining 
VV — 1 variable domains in the QF procedure. It can be shown that Dx, 
is contained in, i.e., linearly dependent оп, the configuration of the VI’ — 1 
domains representing the other Ұр — 1 variables and that the locus of 
Dx, in the domain space can be determined Precisely from the locus of the 


other Ny _ 1 Points and their geometric relationships. Thus, we can 
replace Dy with а sui 


variables. The squa 
of the communality of Xy. Let 


Й = RE (12.28) 
Calculating R% ^ for = 1, ЛГ using ХІ — 1 predictors for each 


variable is impractical where VV is large. Hence the DVP routine QF uses 
a most collinear subset of the VI" — 1 variables (the number of variables 
in the subset is selecteg by the user, with a maximum of 10). 
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| In order to calculate (12.28), initial values must be inserted into the 
diagonals of the subset of variables used to define 1),. The highest corre- 
lation of №; with the remaining XV — 1 variables is used as this initial 
value (j -1,..., XV). Using these initial diagonal values, the sub- 
matrix for calculating (12.28) for Х; is augmented by dividing each correla- 
tion by the square root of the product of the corresponding initial diagonal 
values. This augmented matrix is used to evaluate (12.28) as a straight- 
forward, squared multiple correlation from the symmetric augmented 
matrix with the one adjoined column of semiaugmented correlations of Х,, 
that is, divided by the square root of the initial diagonal values of the 
respective variables in the subset. Equation (12.28) is evaluated for j — 1, 
<, NV, the values just obtained are placed into the diagonal of the 
Correlation matrix, and the process is repeated until the communalities 
Converge to stable values. 

In empirical problems where the predictor variab 
the NV variables in the problem, converged communalities are usually 
quickly secured, but a few may give somewhat erratic values. If all NV — 1 
Variables were used in the regression, the communalities should all 


les are a subset of 


Converge rapidly. 
Approximation B and modified approximation B 


These approximations are based on the fact that a cluster composite of 
“а, also collinear with Ху, can be used asa 


e the correlation of X; 
lation of X; 


collinear variables, say Га, Vis | 
Surrogate for the domain of the variable Х,. Henc 
ànd the cluster composite is approximately equal to the corre 
апа its variable domain Dx,. Let the cluster composite be 


C; = Va + Vis + Ға 


If the definers of this composite are perfectly collinear, it can be shown that 
3 


Ў, Irxv rx vul 


- (12.29) 


This approximation tends to give values biased downward because the 
Cluster composite is based on only approximately collinear variables. To 
E. a less biased estimate this approximation is modified by averaging the 
s ue obtained from (12.29) and the value of the highest correlation for 

У of the NI — 1 variables correlated with Х,. This average is called the 


modified approximation В.” 
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Triads 


The triads method of calculating an approximation to the communality is 
equivalent to approximation B except that only two variables, say 171, and 
Vi», are used to form the cluster composite С, in finding the communality 
of the variable Х,. 


Proportional fit, PF 


This method of calculating communalities is similar to the QF method and 
approximation B with respect to the use of a subset of variables that 
are relatively collinear with the variable for which the communality is to 
be calculated. A set of variables most collinear with Х, (up to nine, under 
control of the user of BC TRY) are selected; these variables are called the 
"reference variables" for X;. The submatrix of intercorrelations among 
these variables is formed, and the diagonal is filled with the highest corre- 
lation that the respective variables exhibit with the other ХІ? — 1 variables 
in the sample. This matrix is adjointed with the vector of correlations of Х, 
with the variables in the matrix. The linear regression equation of the cor- 
relations in the adjoint vector on the correlations in the matrix is solved, 
giving rise to regression weights for each of the reference variables. These 
regression weights are applied to the respective correlations between X; 
and the reference variables. The resulting value is the best estimate of the 
diagonal element for the variable X,, with respect to the degree that the 
intercorrelation matrix of the reference variables predicts the intercorre- 
lations between the reference variables and variable Х,. The fact that the 
reference variables are the most collinear with Х, tends to ensure that 
the regression-predicted value is proportionally the best estimate of the 
diagonal element. Although this method is not in an obvious way a direct 
calculation of communalities, it produces diagonal elements that are con- 
sistent with the pattern of intercorrelations in the rest of the correlation 
matrix. 

| The PF method is an indirect Solution. First the diagonal elements 
in the matrix are filled with preliminary values. Then, new values are solved 
for, for each variable in the matrix, using the PF algorithm. The new values 
are entered as the diagonal elements in the matrix and the process 
repeated until the diagonal elements fail to differ on successive iterations 
Ж. number (determined by the user) of iterations are achieved. 

m of 10 iterations is permitted by the PF routine in DVP. 


Independent dimensional analysis 


The independent successive dimensions from factoring give rise to com- 
munalities. When the factoring procedure is effective, А dimensions will 
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represent the variability shared by the behavior properties of the observed 
variables. If these К independent dimensions аге held constant, the corre- 
lations of the observed variables will be essentially zero. The correlation 
of a variable X, with the А independent dimensions or domains, when 
squared and summed, is equal to the correlation of X, and another hypo- 
thetical variable with the same magnitudes of correlations, that is, X; 
reat = mc рата <р гүрк (12.30) 
А definition of communality alternate to the definition given in (12.27) is 
implied in (12.30). It can be shown that the communality is the correlation 
between a variable X, and another collinear variable that has exactly the 
Same magnitudes of correlation with the other VV — 1 variables in the 
sample, i.e., 
hx, = NAM, (12.31) 


where X' has the same magnitudes of correlations with the other variables 


as those exhibited by М. 
; The roles of factoring іп calculating com 
ties in factoring are encountered again in the next section. 


munalities and of communali- 


Ad hoc diagonal values 


one that is not directly formulated 
munality. Several such definitions 
f empirical success with the 
al bearing on the definition 


An ad hoc estimate of communality is 
On a basic psychometric model of com 
Of diagonal values are in wide use because о 
definition in factoring or because of some ration 
9f communality. The PF method of calculating communalities is an exam- 
ple of an ad hoc method. One other ad hoc definition of diagonal values 
is in wide use in factor analysis: each diagonal value is set to 1.00. The 
Practice of setting diagonal values to 1.00 is widely accepted in orthodox 
factor analysis. For principal-axes factor analysis the 1.00 values in diagonals 
таў produce the most adequate results. In cluster analysis, as implemented 
in BC TRY, the diagonal values should not be set to 1.00 as this will gener- 
aly result in an attempt to define too many clusters. The advantages and 
disadvantages and special implications of diagonal values of 1.00 in factor 
analysis are discussed extensively in other sources. Another frequently 
Used ad hoc estimate of the communalities of a variable is the highest 
Correlation of the variable with the other XV — 1 variables. 


Collinearity in diagonal values routines 


In several of the methods of calculating communalities, subsets of collinear 
Variables must be selected out of the У] variables sampled. In general, 
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collinearity is defined in BC TRY by large values of an index of proportion- 
ality defined and discussed in the following section. In the cluster analysis 
program CC5 of BC TRY a complicated algorithm is used to select collinear 
subsets. In DVP, only the collinearity, defined by the index of proportional- 
ity, between the variable X; and the other VV — 1 variables is used to 
select the subset. 


The BC TRY component DVP 


The purpose of DVP is to apply the equations and algorithms discussed 
above to the matrix of correlations. In addition to the communality esti- 
mates, the component prints the 10 highest correlations of each variable. 


A complete list of the IST input and output of the component is given in 
Table 12.2. 


Cluster analysis 


The theory and techniques of key-cluster analysis are discussed in exten- 
sive detail in Chap. 6. This section is a more summary presentation of the 
step-by-step procedures, equations, and algorithms of the BC TRY pro- 
cedures of key-cluster analysis, as expressed in the component programs 
CC5 and NC2. The options under which the programs are run are also 
spelled out here in a general way. The cluster structure analysis procedures 
and models are applicable in general to the theory of cluster analysis. The 
definitions of a domain, a cluster, correlations of variables with clusters, 
etc., all pertain to cluster analysis as well as cluster structure analysis. In 
the BC TRY System the general practice, in empirical research, is to 
cluster-analyze the multivariate data of the research by an application of 
the cluster analysis programs before doing a cluster structure analysis. 
In BC TRY the cluster analysis programs are not designed to evaluate 
cluster structure but more simply to select clusters on the basis of a 
matrix of successively reduced correlations, each Successive matrix being 
uncorrelated with the previous clusters. In order to do this, each succes- 
sively selected cluster is used to define a fictional “dimension” or factor. 
The correlations of each variable with the fictional dimensions are calcu- 
lated at each step of the clustering process. The properties of these 
factor dimension correlations permit evaluation of the degree to which the 
clusters account for the mean squared correlation coefficient and the 
initial estimate of communality. Cluster selection is the primary purpose 
of the BC TRY cluster analysis programs. As a secondary goal, CC5 
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iterates the calculation of the correlations of variables with the factor 


dimensions in order to reestimate the communalities of the variables until 
ection and after convergence, 


the estimates converge. During cluster sel 
f statistics giving the correla- 


the program calculates and prints a variety o 
tional structure of the factor dimensions. 


Steps in selecting a collinear set of variables 


Cluster selection in BC TRY components CC5 and NC2 is either by inputting 
the cluster members (preset) as part of the control and data cards or by 
collinear subset analysis. In selecting a collinear subset of variables to 
define a key-cluster dimension, the programs use the initial correlation 
matrix for the first cluster residual correlation matrices for each successive 
Cluster. The description in this section uses notation relevant to the initial 
Correlation matrix. 

First, a pivot variable is selected. The criterion for pivot variable 
Selection is intended to ensure that the pivot variable be most likely to 
appear at the "edge" of the multivariate space and be surrounded by 
a cluster of other variables. Such a variable would have relatively high 
Correlations with its most collinear subset and low correlations with variables 
at remote points in the configuration. That is, it would tend to have a 
large variance of its correlations with the other NV — 1 variables. In 
Order to avoid the effects of negative signs on the correlations, the 
Squares of the correlations are used; the variance of the squared corre- 
lations of a variable in the entire set of variables is the index of pivotness 
used 


izj (12.32) 


The selected pivot variable for cluster j is the variable with the largest o°. 


This variable becomes Ул in the cluster. 


Next, at least one additional definer ! 
follows. The second variable in the cluster V; is that variable with the 
highest index of proportionality Pix» Тһе variable X,, the third variable 
in the cluster Vjs, is added when the proportionality indexes of it with 
the previous two, that is, Prax: and Pi, аге greater than .40 and it 
has the highest mean index of proportionality with the two variables 


already included. The same criterion is applied to the fourth variable 
Of the cluster except that the average index is found for the previous 
three variables. Additional definers are included only if their mean index 


is added to the cluster, as 
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of proportionality is within twice the range of the indexes of proportion- 
ality among the four first-selected variables and if all of the indexes of 
proportionality of the variable and the previously selected variables are 
greater than .81 (an arbitrary criterion that can be modified by the program 
user). This combination of criteria is intended to select a larger number 
of variables for a cluster if the cluster is extended only by adding definers 
relatively collinear with the definers already selected. Also, when four 
variables are selected, the criterion for membership is tightened to permit 
extension of the cluster membership only when the additional variables 
are highly collinear with the already selected cluster definers: this permits 
definition of larger clusters only when the cluster is very "tightly" defined 
in the configuration. 

If a pivot variable fails to pick up an additional definer, a trial-and- 
error procedure is utilized. A second pivot variable is tried, the variable 
with the second highest variance of squared correlations. Four (or another 
number specified by the user) such pivots are tried. If none picks up the 
additional definer, factoring is terminated. 

The collinearity criterion of cluster selection is based on an index 
of proportionality P?, The justification of this is as follows. We seek a 
subset of variables to define a cluster, such that the correlations of the 
variables with the cluster dimension can be used to reproduce the inter- 
correlations of the variables in the clusters. This would mean that the 
cluster dimension (',, treated as а Simple variable, would interact with 
two definers of the cluster so as to make the partial correlation of the 
definers, conditional on the cluster dimension, have a value of zero 


TOU = LXX. — Exc x6. = 100 
NM Рх, Тху, 3 
мћеге 
Тх,х, = Рх 12.33) 
Dr СУ X, 


Any two members of the subset of variables defining the collinear cluster 


have proportional correlations. This is given by taking the ratios of the 
ше expressed in (12.33) for two variables in the cluster (",, say 
^u 2 

Ш ik 


Trax, _ Mul xe 


е = Ву юн Гей, ЖҰ (12.34) 
TVX, Тус, С. 

where В is the coefficient of proportionality of the two collinear variables. 
The degree to which B varies as a function of у in (12.34) is a measure 
of the degree to which the two variables Г and V, are not collinear. 
This can be measured by the departure from 1.00 by an index called the 
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index of proportionality" (see Burt, Tucker; 1951; and Wrigley and 


Neuhaus, 1955): 


Хз 
м 2 
( Ў, ry xxx) 
кта 


(12.35) 


эз =. 
Pos e тё 


Же ЖЕ 
Y TX, Ne » YN Ne 
= k=l 


2 is to 1.00, the more nearly collinear 
d in CC5 and NC2 is simply 
s of P? are 


For any pair of variables, the closer P 
they are. The mutual collinearity criterion use 
Pura ma of selecting a subset of definers whose value 
as near to 1.00 as possible, beginning with the pivot variable. 

E The procedures for calculating the index of pivotness and the index 
m. UE in NC2 differ from those in CC5 in that there are no 

вопа! elements in the correlation matrix for NC2. Equations (12.32) 
апа (12,35) are written for the CC5 procedures and must be modified to 


represent the NC2 procedures. 
bos In CC5 it is possible to indicate 
$n pose each cluster. A special applic 
Umber 5, for the ith cluster to 1. Each c 
es Variable. The analysis defined in this way is known as the ‘square 
Ot method" or the “pivot variable method"’ of factoring the correlation 
Matrix. |t is possible to select the pivot variable in this application by 
Other methods than the index of pivotness method described above. The 
Communality, the sum of squared correlations, or the pivotness index are 


the number of variables that will 


ation of this capability is to set this 
luster is thus defined by the 


all ; р 
Possible selections. 
grams selects the number 


When the user of the cluster analysis pro 
e number of variables in the entire 


d factor analysis. This option is 
20. If VV is greater than 20, 
ted on the basis of the 
quared correlations; 


Of Variables to be in each cluster as th 
UM the resulting analysis is a centroi 

Pplicable only where У] is no greater than 
? special option permits subsets of 20 to be selec 
Sum of squared correlations or the variance of the 5 
the 20 variables with the largest sums OF largest variances are used to 


define a salient centroid in the entire set of variables. 
Reflecting the definers of a cluster 
als are ordered on all the definers of a 


causing the definers to intercorrelate 
y of the definers they must be 


T Р 
А metric by which individu 
с " 
Uster may not be unidirectional, 
Negatively. To ensure unidirectionalit 


imally positive. A reflection is accomplished simply Py multiplying the row 
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and column for a reflected variable in the correlation matrix by —1. For 
small clusters (10 or fewer members) it is practical to inspect the 2^-! 
(5 is the number of variables in the cluster) different patterns of positive 
and negative signs that define the possible reflections of the matrix. The 
optimal reflection is obtained in this way, giving the maximal sum or 
correlation coefficients among the definers of the cluster. Where 5 is too 
large to make this procedure attractive, the variables are reflected 
successively until the sum of correlations in the matrix fails to increase 
with additional reflection. The reflected variable at each step is that 
variable having the largest negative sum of correlations in the submatrix 
of the cluster definers. 


Factor coefficients 


For each cluster in the cluster analysis a fictional dimension is defined 
such that it correlates very highly with the cluster domain with the restric- 
tion that all the fictional dimensions are uncorrelated. These fictional 
dimensions are called factors. The factors are defined in the same succes- 
Sion that the clusters are defined. The first factor is defined by the first 
cluster, the second factor by the second cluster, etc. Each successive 
factor is used to determine the correlation among the N V variables that is 
independent of the factor. This process results in a residual matrix of 
correlations after the first factor, independent of the first factor; a 


Second residual matrix of correlations after the second factor, independent 


of the first two factors; etc. The successive factor dimensions are defined 


Іп a cluster analysis by applying Eq. (12.36) to the correlations in the 
Successive residual matrices. The superscript on the correlation coeffi- 
Gents in the equation denotes the factor; ry, denotes the correlation 
matrix used in determining the kth factor. In the instance of k = 1 the 
original correlations are involved; for k = 2 the correlations are those 
obtained by calculating the residual correlations after the first factor, i.e. 
independent of the first factor; etc. Denoting the K factors by Fi, Fo 


ECC Fx, the equation for the correlation between a variable X, and 
the jth factor is 


5; 


КІРДІ 
Гу» 


tx, = Tp (12.36) 
TEPS 
о=1Е-1 


Since the K factors are independent, the factor correlation variances 
are additive and the sum of the Squared factor correlations of a variable, 
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summing across all factors, is the communality of the variable 
ұт 2, ye, (12.37) 


Each of the terms in this sum is known as the “partial communality” of 
the variable on the factor, that is, /X,», is the partial communality of the 
ith variable on the jth factor. 

For the purpose of stating the percentage of the communality of 
a variable determined by the successive dimensions, each of the squared 
factor correlations (12.36) is divided by the total communality (12.37). The 
square root of these values is called an ''augmented ог normalized factor 
coefficient.” They are calculated directly by dividing each factor coefficient 
(12.36) by the square root of the total communality (12.37) 


пут (12.38) 


The sum of squares of these coefficients, for each variable i =1,..., 
NV, is equal to 1.00. The implication is that each variable is mapped onto 
the surface of a unit hypersphere of K dimensions. The coordinates of 
these surface points are used in the geometric description of the cluster 
structure in the BC TRY component SPAN. 

In noncommunality cluster analysis, component NC2, 
Calculations are similar to those of CC5 with the exception that there is no 
diagonal element in the correlation matrix. The average intercorrelation 


instead of the sum of intercorrelations is used in (12.37). 


the basic 


Residual and reproduced correlations 


The so-called “fundamental factor theorem" states that the correlation 
Of two variables can be reproduced by the inner product of factor coeffi- 
Cients for that variable if sufficient factors are defined in the factor 


analysis of the variables. Since, in general, only K factors are defined, 


Only a certain portion of all the intercorrelations among the observed 


Variables is reproduced. The reproduced correlation of Х, and Х, is given 
by 
K 
тыл = 5 БҰҒАТ (12.39) 
Ja 
In the general case these coefficients are not equal to the original coeffi- 
cients. The differences between the original correlations and the reproduced 
Correlations are called the “residual correlations." In the process of 
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factoring the correlation matrix, i.e., of calculating the successive factor 
dimensions, Pi, F» . . . , Fx, the amount of the original correlations 
reproduced increases with each successive factor. Thus, there is a 
succession of residual and reproduced correlation matrices. The residual 


correlation between X; and Х, after F, has been accounted for in the 
factoring is 


Е 
HL cquo X Бен, (12.40) 


4-1 
It is the matrix of these residuals that is used to define the k + 1 cluster 
and factor F,,,. When the residuals are sufficiently small to satisfy some 
criterion of saliency of correlation, the factoring process is stopped and 
K = k is accepted as the salient dimensionality of the multivariate data. 


Terminating factoring 


Two criteria for termination of factoring are built into the cluster programs 
of BC TRY. When the total squared correlations among the variables are 
approximated by the total squared reproduced correlations based on k 
factors, factoring is terminated. The degree of approximation is selected 
by the user of the system to represent what he thinks a salient amount of 
the correlation matrix is. The successive factors also account for a succes- 
змеју larger amount of the initial estimates of communality. When the 
total communality accounted for by k factors approximates total initial 
communality by some salient degree, the factoring process is terminated. 
When one or another of the following quantities is close enough to zero 
to satisfy the saliency criterion, factoring is terminated 


T= [RED (12.41) 


= (12.42) 


Iteration for convergence on communalities 


Іп communality cluster analysis the communality estimated on the basis 
of the factoring is generally quite different from the communality initially 
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estimated. If the initial estimates are replaced by the estimates from 
factoring and new factor coefficients calculated, the communalities 
estimated from the second factoring will be still different. In order to have 
ent, the process of calculating 
f times until the difference 
Iculations is essentially 


estimates of communalities entirely consist 
factor correlations is repeated a number о 
between communality estimates on successive ca 
zero. The process is called ‘iteration to convergence on communalities.”’ 


BC TRY component CC5 permits convergence to a desired criterion of 


convergence or as a standard option four iterations, after which the 


differences are quite likely to be small. 
The BC TRY component сез 


Options in the component permit the application of a wide variety of 


analyses, among which are: 


1 Empirical key-cluster analysis. The program selects the definers of the K 


clusters and determines the number of clusters. 
2 Preset key-cluster analysis. This form of cluster analysis has been known by 


factor analysts as “multiple group factoring." The user of the program inputs the 
definers, selected on rational grounds or on the basis of a previous empirical key- 
Cluster analysis. 

3 Preset dimension analysis. The user let 
but indicates on the control cards the number of clusters desired. 

4 Pivot variable analysis. Each dimension is defined by one variable, selected 
in the way pivot variables for empirical cluster analysis are selected. This mode of 
factoring is called "square root factoring” in the literature of factor analysis. 

5 Centroid factor analysis. In standard centroid factor analysis the factors are 
defined on all NV variables. In centroid factor analysis in the СС5 component, the 
factors are defined on the 20 variables with the largest sum of squared correlations. 
'f N V is equal to or less than 20, the analysis is а standard centroid analysis; otherwise 
it is referred to as a salient centroid analysis. А 

6 Bifactor analysis. The first dimension 15 defined by a general or salient 
centroid factor analysis, and the remaining dimensions are defined by key-cluster 
analysis, either empirical or preset. The use of CC5 in this analysis is repeated. The 
first use of CC5 involves the use of the centroid factor analysis option and a preset of 
the number of dimensions at one. The program FAST in the system is then used to 
replace the original correlation matrix with the residual matrix, communalities are 
reestimated by using the component DVP, and then CC5 is used to perform the key- 


Cluster analysis. 

7 Sleeper analys 
delete temporarily from the С 
to be involved in the cluster 


s the component select the clusters 


is. Before using CC5 the analyst uses the program SLEPI to 
orrelation matrix a number of variables he does not wish 
factoring procedures. Once the cluster analysis is com- 
pleted, the user calls SLEP2, which recalculates the factor coefficients and other 
statistics from CC5 on all the original set of variables, including those that were sleep- 
ers in the CC5 analysis. This design requires a sequential use of four components, 


SLEP1, DVP, ССБ, and SLEP2. 
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8 Ordered and designed analyses. By sequencing the use of components in 
the BC TRY System before and after CCS, the cluster analysis program сап be used to 
achieve a very wide spectrum of analytic designs. 


The input and output of the component program to and from the IST 
are listed completely in Table 12.2. In addition to these files a large number 
of statistics and lists are output on the printer. At the beginning of the 
analysis the following information is output: descriptions of the methods 
and parameters selected by the user, mean squares of the original 
correlations, initial communality estimates of all the variables, and the 
Sum of initial estimates of communalities. At the end of the last iteration 
of factoring the following are output as a summary of the analysis: lists 
of cluster definers, the factor coefficients of all variables on each factor 
dimension, the partial communalities, the cumulative communalities, the 
augmented factor coefficients, the squared factor coefficients, the 
proportion of the sum of squared correlations accounted for by the K 
dimensions, the proportion of the sum of communalities exhausted by 
the K dimensions, the mean of the Squares of residuals after each 
dimension, and the matrix of residual correlations after the last dimension. 
For the first iteration of factoring the printout includes: sum of squared 
correlations or residual correlations for each variable, 
Squared correlations, the index of Proportionality P? 
the dimension with each of the ХУ variables, the subm 
between the defining variables, the list of variables d 
the factor coefficients of all of the variables, the partial 


variables, the cumulative communality of the variables, the Proportion of 
the communality and sum of Squared correlations exhausted by the factors 
50 far defined, and the sum of squares of the residual correlations. 


the variance of 
of each definer of 
atrix of correlations 
efining the cluster, 
communality of the 


The BC TRY component NC2 


elements implies that a pivot 
munality exhaustion criterion 


although it can be used. The sol 
val 


is not a particularly meaningful criterion 


ution is not iterated since the communality 
ues аге not involved in defining the factor dimensions. 


The spherical Configuration 


The factored dimensions, bein 


& derived from residuals that are values 
obtained by holding prior dime 


nsions constant, are independent (orthog- 
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onal) axes. The factor coefficients of a variable are coordinates of the 
variable on these axes. Plotting of the variables by their coordinates 
describes the configuration of points (variables), the structure of the 
relationships among all the variables. 

If the variables fall in independent, uncorrelated clusters, the 
dimensions derived by key-cluster factoring usually will pass through them. 
But clusters of variables are usually correlated (oblique); the key-cluster 
dimensions usually come near to the oblique clusters, subject to the 
condition that the dimensions are orthogonal. 

Efforts to interpret orthogonal dimensions of any sort are usually 
ill directed since they are merely an orthogonal "reference frame" that 
“holds” the configuration. To rotate the reference frame does not change 
the configuration; the axes, lying in new positions, e.g., varimax or quarti- 
max axes, are not likely to take on meaning by virtue of being rotated 
because they are still orthogonal whereas the structure is usually oblique. 
In short, the configuration is the thing; the independent cartesian dimen- 
Sions that frame it are rarely of real interest. 

We need to represent visually the structural pattern as a whole, 
the total configuration. To do so is the object of spherical analysis, pro- 
8rammed by the component SPAN2. The total configuration would be 
invariant with respect to the method of factoring if, by whatever method, we 
factored through to NV dimensions. But we do not: we factor on k < NV 
dimensions. But a distinctive feature of factoring is that, as factoring 
Proceeds, additional dimensions describe a configuration that departs 
less and less from the total configuration. Thus a set of K salient dimen- 
Sions describes the salient features of the configuration; additional 
К +1, K + 2, etc., dimensions trivially alter the configuration. Key- 
cluster factoring shares with other factoring methods this attribute, but 
the configuration is stable even if we add more key-cluster dimensions. 
Rotated centroid or principal-axes (FALS) dimensions may present quite 
different aspects of the configuration as additional trivial dimensions are 
added and rotated, particularly if the varimax rotation method is used. 

The main point here, however, is that graphic presentation of the 
Salient configuration is extremely important in structure analysis. For 
when the configuration is visually examined, arbitrary features in cluster 
Selection by the factoring method become immediately apparent. The 
investigator will study the configuration, revise the cluster selection if 
Necessary, and rerun the whole analysis with preset definers of the 
dimension. 

We can summarize the objectives of configuration analysis as follows: 
(1) to select the particular properties of a geometric model that provide 
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perceptually the best layout of the total salient configuration; (2) to 
select the minimal subspaces of the К dimensions by which to present 
the salient configuration (SPAN2); (3) to select the particular type of 
factored dimension that presents aspects of the configuration most 
invariant to the adding of dimensions (GYRO); (4) to select from the 
salient configuration the final set of least oblique, most collinear clusters 
by which to score individuals (objects). 


Selection of spatial properties to 
depict the salient configuration 


The geometric model on which the configuration is presented is the 
generalized unit sphere. This model relies on the fact that any variable Х, 
can be plotted as a point on such a sphere since the sum of its squared 
augmented factor coefficients (taken as coordinates) equals 1.00, that is, 


Theb 70 E Thae = 1.00 (12.43) 


As a surface point, variable X, has an augmented communality of 1.00, 
(12.38), whence the point refers to the domain score for the variable Nj. 

To clafify this logic, take an example of two variables, X, and X, 
all of whose geometric properties in the spherical model are depicted 
in Fig. 12.1. Within this two-dimensional subspace of the full solution the 
reproduced correlation is .429, from Eq. (12.39). The interdomain correla- 
tion (augmented correlation) for the two variable domains is .507, from 


ГрхаВху = Трхағ, "кар T e А Тру юру. Ёк (12.44) 


By definition of the correlation (12.44) and analytic geometry of a sphere, 
"py. Dx,, iS the cosine of the central angle @ of the variables. Here 0 = .60. 
| The communalities, from factoring, have the values /? = .88 and 
h} = .81. Since the observed variable Х, is by definition collinear with 
Dy, the raw score variable X, is located at distance h, from the origin 
and Dx, at distance 1.00 on the surface of the sphere. The correlation 
"x,px, is the same as the square root of the communality /,. 
| Thus, within a three-dimensional sphere defined by factored dimen- 
sions P, Fə, Ёз, the plane in Fig. 12.1 is a slice through the origin, on the 
surface of which lie domains Dx, and Dx, subtended by a 60? angle from 
the origin and with the fallible (observed) variables on their vectors as 
shown. Any two variables can be represented geometrically in this manner 
whatever the dimensionality of the solution. 
Note the other geometric properties of the relations between X. 
and №, Іп Fig. 12.1. Most important is the distance between Dx, and Dx, 
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FIGURE 12.1 с 
Spherical model of two variables and their domains. 


and the distance between Х, and X. The actual value of гру,ру, I$ repre- 
sented twice, from the perpendicular projections of Dx, and Dx, on the 
two vectors, as shown. The correlations between the variables X, and Х, 
and the domains Dx, and Dx, are perpendicular projections from their 
loci. These are the oblique factor coefficients, of special interest when the 


domains are oblique clusters. 


Selection of minimally sufficient subspaces in 
which to describe the configuration (SPAN) 


When the salient factored dimensionality А exceeds three, the total 
configuration cannot, as such, be visually presented, since it lies in a 
hypersphere. But the hyperspherical configuration can be fractionated 
into subspaces in which only those variables are plotted whose augmented 
communalities, i.e., the proportion of their communalities, approach 1.00 
ог any other high optional criterion of sufficiency S. If we let S be less 
than 1.00, there usually exists aset of three-dimensional subspaces in 
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which all or parts of the total configuration can be described. For example, 
it is not uncommon to beable to describe the total configuration on a 
hypersphere as a configuration in one three-dimensional sphere at a 
sufficiency criterion of, say, .80. 

The logic of decomposing the hypersphere into subspaces is as 
follows. At a criterion S < 1.00, all variables are allocated to one-dimen- 
sional subspaces (single dimensions). The dimension with a maximal 
number of variables that meet the criterion is selected first. The dimension 
that includes the maximal number of variables лог included in the first 
is selected second, and so on. The dimensions are thus ordered by 
saliency in their accounting for the variables by one-dimensional sufficiency 
at level 5. Then two-dimensional sufficiency (planes) is tested, in which 
all variables are allocated to pairs of dimensions. The plane that includes 
the most variables is selected first. Then the plane that includes the 
maximal number of variables not included in the first plane is selected, 
and so on. Some variables may carry over to two planes. The minimally 
sufficient two-dimensional subspaces are thus ordered by saliency. The 
same principle is followed for three-dimensional subspaces of larger spaces. 

When the dimensionality is three, all the variables can be printed 
as loci on the surface of the sphere. When the dimensionality is greater 
than three, each sphere includes only variables that meet or exceed 5, but 
any variable with S < 1.00 lies below the surface, i.e., projects onto 
dimensions not in the sphere to the degree 1 — 5. Many variables may 
Carry over from one sphere to another. By simultaneously looking at the 
configuration on different spheres, “perceptually superimposing" the 
common variables, the investigator can often "see" the total configuration 
on the hypersphere. 

On any sphere there is an octant formed by the spherical triangle 
рт ique сена that intersect at the three factored dimen- 

"i. phere. If all variables included in the sphere lie 
Within this octant, they define a "positive manifold"; i.e., their factor 
coefficients on all three dimensions are positive. To maximize the positive- 
ness of the manifold, it may be necessary to reflect some variables. 
Procedurally, one selects as the “formal center" of the octant a point 
whose factor coefficients on the three dimensions are equal and the 
Squares: of which sum to 1.00. Then all variables are reflected if by so 
doing they are nearer the formal center. Such a procedure concentrates 
the configuration at one place on the sphere, thus facilitating the per- 
ceptualization of the configuration. 

The configuration can best be visualized if the centroid of all plotted 
variables is in the line of the eye. If we call the factored dimensions at the 
vertexes of the spherical triangle dimensions Fa, Fy, F., then “centering” 
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the configuration for visual presentation is achieved by “rigid rotation" 
of Fa Fẹ and Fa as follows. Calling the horizontal, vertical, and the 
inferred third dimension in the plane of the paper Ғғ), Fl, respectively, 
the centroid of the points is rotated to /;, and / and Fs are rotated to 
Positions 7, Ғ), respectively, in such a fashion that the angle of rotation 
of F, to I; equals the angle of rotation F, to F}. In the BC TRY component 
the I, F}, F! coordinates of each included variable are given in the 
Printout and designated by a check (V or R). The transformation is 
performed by a matrix multiplication. The matrix of augmented factor 
Coefficients, гох,е, СУ V. by 3), for the three dimensions defining the sphere 


is postmultiplied by a matrix T' (3 by 3): 


EC?) —E(F 2) —E(F.) 
ie By _ EU) EU) 
T= Е(Р) 1- 1+ Е(Р,) 1+ ЕР.) (12.45) 
FE E(F) EP.) _ EF. 
EF.) = 1+ EQ) 1 1-- (Т) 


The elements in T аге 
NV 


1 A A 
NV py 
(Р) = ist 2 : 
1 A NS 
[2e 
j ізі 


ensional space can be represented 
le meeting the sufficiency criterion 


j2abc (12.46) 


The configuration in the three-dim 


by a two-dimensional graph. Each variab 
S in the space is projected perpendicularly onto the plane defined by 


P; and Б), simply plotted as cartesian coordinates. Since each point in the 
Configuration is very close to the surface of the unit sphere (depending 
on 5), the location of the surface point is implied by the location in the 7; 
and F; plane. 

Selecting factored dimensions that make the salient 
tion most nearly invariant to added 


d the rotation of axes (GYRO) 


configuration mo 
dimensionality, an 
s оп K dimensions of any variable X; 


determined by key-cluster factoring do not materialy change if one 
Continues to factor to K + 1, K + 2, etc. dimensions. Augmenting the 
factor coefficients under added dimensionality alters only those variables 
that pick up added communality from the added dimensions, but this 


Unaugmented factor coefficient 
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added communality is usually small on dimensions added to the salient 
set K. Therefore, the configuration described by K dimensions is usually 
not materially changed from that described by dimensions greater than K. 

The problem is important because the decision on K is arbitrary, 
depending on the predilections of the investigator. Thus, if the configura- 
tion is changed radically because of the arbitrary decision to add more 
dimensions than K, then the presented configuration is sensitively a 
function of an arbitrary decision and becomes untrustworthy. 

Since centroid and principal-axes dimensions are defined by all 
variables, variously weighted by their factor coefficients on the different 
dimensions, they fall not in centers of collinear clusters but centrally to 
all variables. Usually impossible to conceptualize, the dimensions are 
rotated "to meaning." This means placing them in the configuration so 
that they may be better defined by variables, hence taking on meaning 
derived from nearby variables. 

Quartimax rotation (Neuhaus and Wrigley, 1954; see Harman, 1967, 
pp. 294ff) is a rotation that tends to concentrate the factor coefficients 
of each variable on a few dimensions. The net effect is to pass the quarti- 
max dimensions into or near clusters of variables, as key-cluster dimen- 
sions do. Therefore, when the configuration is described by quartimax 
dimensions, it tends, like that described by key-cluster dimensions, to be 
immaterially affected by added, rotated dimensions. 

Varimax rotation (Kaiser, 


| 1958; зее Нагтап, 1967, р. 301) is other- 
wise. This rotation tends to equ 


| alize the factor coefficients of all rotated 
dimensions. The consequence is to bound the configuration, i.e., generate 
а positive manifold, with each dimension given about equal weight as a 
bounding intersection of planes. Therefore, by arbitrarily adding additional 
dimensions the factor coefficients on earlier dimensions may radically 
change. The main difficulty is that if the investigator is interested in 
Interpreting the dimensions, then by adding dimensions the interpretation 
of earlier dimensions may radically change. For this reason, describing 


the configuration by varimax dimensions Seems less desirable than doing 
so by quartimax. 


Selection of the final set of oblique 
clusters from the configuration 


With the salient configuration vi 


5 sually presented in SPAN the investigator 
can finally select the obli 


MC. que clusters of variables by which to score 
individuals by FACS. it is usually found that the key-cluster dimensions 
pass as near to collinear Clusters as is possible subject to the condition 
of their being orthogonal, Usually the defining variables of key-cluster 
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dimensions are retained as defining variables of the final set of oblique 
clusters selected by the analyst. But not always. Normally, the investigator 
wishes to add some variables to an oblique cluster that had not been 
accepted as definers by the somewhat arbitrary mutual collinearity 
criterion. The ‘‘unifactorial” allocation of variables in CSA2 is helpful 
here. Some definers at the edge of a defining collinear set may be rejected. 
Having decided on the final most independent, most collinear oblique 
clusters, the analyst usually reruns the analysis—especially CSA and 


SPAN with preset definers. 
The BC TRY component SPAN 


is a rather straightforward application of the 
ccessive three-space is printed out as a 
map of the cartesian coordinates of Е, and F+, along with suitable markings 
in the plot to indicate the original factor dimensions, Fu, Ғь, and Ea the 
Surface arc connecting these points, and a scaling. Each variable plotted 
in a three-space is located by а spotter, showing the relative location of 
the variable on the page of the plot. The basic statistics of each of the 
three-spaces needed to account for the variables are printed. These are 
the coordinate values in Fa, Fe, and F., the augmented coordinate values, 
and the actual communalities of the variables. The user of the component 
has a wide variety of options regarding the operation of the component. 
For example, the user can indicate on control cards those three-spaces he 
wishes to have printed, or he can have printed certain selected subspaces 
along with the subspaces that the program would print as a matter of 
the logic of the program. The spaces can be defined in terms of the 
unrotated factor coefficients or the factor coefficients obtained by the 
use of GYRO. Subspaces not containing a certain minimal number of 
Previously undescribed variables can be suppressed in the printout, 
Saving the listing of subspaces with little information beyond that con- 
tained in other subspaces already printed. The criterion 5 can be set to 
some nonstandard value (standard is .80), variables with communality 
below a given amount can be deleted from consideration, and the trans- 
formation can be defined on the cluster definers alone instead of on all 
NV variables. The IST summary in Table 12.2 indicates all the files read 
by the component. SPAN does not output files to the IST or DST. 


The component SPAN 
techniques just described. Each su 


The BC TRY component GYRO 


application of the rotation procedures of the 


This component is a direct 
max method as described by their originators. 


quartimax method or the vari 
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The rotated factor coefficients and related statistics are printed by the 
component. The user specifies which of the methods and how many 
iterations or what convergence accuracy he wishes. 


V-analysis of many variables 


Some investigators face the task of performing a cluster or factor analysis 
of data on a very large number of variables. A seeming difficulty is that 
the BC TRY System has a limitation of V Vmax (120 or 90 on systems in 
operation) variables on many of its components. BIGNV surmounts this 
difficulty. The object is to provide systematic procedures by which the 
BC TRY System can be employed to perform a V-analysis on any problem, 
however large the number of its variables may be. 

The objective can be quite simply stated. Imagine that we had a 
super BC TRY System that could perform a V-analysis of an indefinitely 
large number of variables, a system with no limit to the number of variables 
it could handle. Suppose, further, one was actually confronted with a 
problem with 1,000 variables. How could one use the system having only 
a single-run limit of N Vmax variables to discover the cluster structure that 
would have been found in one run by the imaginary super BC TRY System? 

The general answer to such a question is as old as statistics. One 
proceeds by the use of sampling procedures. There is, however, a new 


wrinkle to BIGNV beyond mere random sam 


pling. Successive samples 
of NV 


max Variables are taken in such a fashion that V-analysis on these 
Samples increasingly converge upon the results of a hypothetical super 
Single V-analysis performed on all the variables. The V-analysis on the 
final Sample of N V max variables is designed, in short, to depict the structure 
of all NV variables, however large У V may be; i.e., it is the solution, we 
would expect, that would have been discovered in one run on all NV 
variables by a hypothetical super BC TRY System. 
Strictly Speaking, BIGNV is not a single component program of the 
m; it is a design of analysis that involves the use of a sequence of 
Components to achieve the Statistical objective. Nevertheless, since the 
Sequence stands as а unit it is treated а 
sample Convergence embodied in BIGNV 
There are four major stages in BIG 


Syste 


S a component. The principle of 
is applied also in O-analysis. 
NV analysis. 

S : " у я 
1 Stage 1: Tentative selection of а Stable set of pivotal dimension-defining 


Е T e is to run V-analyses on a small number of 
variables drawn randomly from the NV variables. The samples chosen are sufficient 


to establish the dimensionality K of the NV variables and to select a stable pivotal set 
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of defining variables of these dimensions. To do this, a full-cycle CC analysis is per- 
formed on a first random sample. Next, a second random sample of the variables is 
chosen that includes as markers the most collinear sets of definers observed in the 
first sample plus other variables having high communalities. Then, a CC analysis is 
run on this second sample. Its most collinear pivotal sets of dimension definers may 
or may not include the first pivotal sets; the second run provides an opportunity to add 
to the first pivotal sets, to change them, even to change the dimensionality. Now, a 
third random sample is chosen, including the best pivotal sets plus other high com- 
munality variables from the first and second гип. It is a matter of judgment how many 
Such runs with carry-over variables are necessary to select a final stable set of pivotal 
dimension definers; perhaps two or three should be sufficient even in problems with 
а very large NV. 

The carry-over of variables into successive random samples may introduce 
some undesirable cumulative biasing effects. An alternative method of selecting 
dimension definers that seems to reduce, perhaps eliminate, such possible cumula- 
tive bias is to confine the selection to the most salient variables drawn from fully 
random samples, i.e., samples not augmented by any carry-over variables. By most 
salient here is meant the most general variables, as objectively indexed by the value 
of their communalities derived from CC factoring (or from a DVP program) or as 
measured by the mean squares of their correlations or by the variances of their 
Squared correlations. 

2 Stage 2: Construction of a composite configuration of all NV variables. In 
this stage the sets of pivotal clusters from stage 1 are used as common anchor di- 
mension definers of a series of preset V-analyses on subsamples of the whole pool of 
all N V variables divided into lots having sizes manageable by the BC TRY compo- 
nents, i.e., into lots of N Vmax — S, where S is the number of common variables that 
compose the dimension-defining clusters. These lots do not have to be random 
samples; they can be any grouping of the N V variables. The printouts of CSA in these 
lots are strictly comparable and can be concentrated into one composite CSA listing, 
because it is a matter of indifference what lot any particular nondefining variable is in. 
Findings from CSA are governed entirely by relations with the definers and not at all 


by the context of nondefiners. 
The same is true of the spherical configurations on the SPAN spheres of all 


these lots. Across all lots, the configurations on a sphere of a given set of three di- 
mensions are strictly comparable. With a little juggling, one can draw a single com- 
Posite sphere that includes all the variables appearing on it in all lots. The composite 
spheres of the problem give an overall composite configuration of all N V variables. 
In drawing the composites one eliminates from consideration variables with low 
communality. 

3 Stage 3: Sector analysis for the final selection of 1/ core V-clusters. With all 
NV variables projected into the same configurations, the final core V-clusters can be 
drawn from all NV variables. They consist, first, of K pivotal dimension-defining 
clusters and, second, of any dependent clusters that appear in the configuration, i.e., 
"natural" clusters that may lie in between the K pivotal clusters. Together the two 
groups form the final 11 clusters that describe the total configuration. 

In stage 3, therefore, one chooses what appears in the composite configuration 
to be the best sets of most nearly independent pivotal dimension definers and, using 
them as new pivots, one reruns the analyses only on samples of variables that are 
most likely to be chosen to be either in final clusters of dimension definers or inde- 
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pendent clusters. For example, for the first rerun in stage 3 preset on the new revised 
sets of K pivots, one would probably include, in addition to the pivots, only variables 
in the zone in SPAN around the terminus of the first dimension. The second rerun 
might include, along with the common pivots, those in the larger sector embracing 
variables lying in the plane of the first and second dimensions and those also around 
the terminus of the second dimension. Guidance on what sectors to select in the 
runs comes from a study of the general structure of the configuration assessed in 
the light of the objectives to select the most collinear, meaningful, and salient core 
V-clusters. 

At the end of sector analysis the best set of 1/ core V.clusters composed of the 
most nearly independent set of pivotal clusters will have been selected. These core 
V-clusters could be the ones on which scores would be computed later by FACS. 

4 Slage 4: The final representative configuration. In most studies there will be 
hardly more than N V max variables in the final 1/ core V-clusters. These can all be run 
in one single V-analysis. In principle, the results should be about what would be found 
if all УУ variables had been run together in a one-run super V.analysis. 

In a really massive problem, it may be that more than NV 


variables survive 
to stage 4. If so, then more than one run may be necessary, 


but since they would be 
preset on the same final pivotal clusters, the results would be strictly comparable and 
hence projected into one composite. 


max 


The BC TRY components for BIGNV 


There are several programs that are used in various combinations for the 
execution of a BIGNV analysis. They are quite complex, and their descrip- 
tion here would serve no useful purpose. The components are available 


only on the IBM 7094 in Fortran ІІ. The procedures described above can be 


approximated, with difficulty, without any special component programs. 


Comparative cluster and factor analysis 


The BG TRY component COMP is designed primarily to permit a direct 
comparison between the dimensions discovered in one group of individuals 
and those discovered in other groups. The data on which the comparisons 
are based are the dimensions of the various groups derived individually, 
by group, from a V-analysis. There are no requirements in comparative 
analysis that the different groups have the same number of dimensions. 
ila v sue to be compared across the groups may be termed 
ӘНІНЕ, Chea ice entities, the columns to be compared. The row entities on 

nsions are "observed" аге the VV variables, termed the 
common referents of the dimensions. In general, there must always be 
a Set of common referents in relation to which entities are observed if 
we are to compare them. The observations in the matrix of entities vs. 
dimensions are the oblique factor coefficients of the dimensions on the 
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Хү variables. The degree to which two dimensions are alike is measured 
by the index of similarity of their two columns of entities. The index of 
similarity is optional in the BC TRY component, the index of proportion- 
ality, the correlation coefficient, or, best, the cosine of the interangular 
separation cos 0 of the comparison entities in the solution space. 

There are four main steps in a comparative analysis. The first step 
is to obtain the matrix of factor coefficients for each of the separate 
groups to be entered into the comparative analysis. This job is executed 
by the component program COMPI, which punches the oblique factor 
coefficients from the results of a cluster structure analysis, i.e., from CSA. 
When СОМРІ is preceded by FALS and GYRO, it outputs the rotated 
orthogonal factor coefficients. The second step is to adjoin these factor 
coefficient matrices so that there are ХІ” rows and Ky Kew + Ка 
columns, where К, is the number of dimensions in the ith group, and 
where there are (; groups. This is the first step of the component СОМР2; 
the (/ separate decks of cards output from each of the (© V-analyses by 
СОМРІ in each of the analyses are stacked together and entered in the 
data deck for the component COMP2. The third step is to delete selected 
columns and rows from this adjoined matrix, as indicated by the user in 
COMP? control cards. The fourth step is to compute the similarity of each 
pair of the dimensions, within and between the separate groups. 

There has been considerable uncertainty about how to measure the 
Similarity of two dimensions when the observations are not scores on 
individuals but are factor coefficients. What should be the index of 
Similarity? General concurrence is that it should not be the correlation 
between their two columns of factor coefficients. Some writers have 
proposed /?, the proportionality index, the square of which is our index of 
mutual collinearity used in key-cluster factoring (see Burt, 1948; Tucker, 
1951; Wrigley and Neuhaus, 1955). The answer, however, is quite straight- 
forward. Clearly within a group, the formula for the index of similarity 
between two dimensions computed from factor coefficients should give 
the same value as the correlation between those dimensions (factors) 
already computed in CSA. We already have those values actually calculated 
in CSA as correlations between domains. Therefore, all we need is the 
formula by which these same values can be computed from the columns 
of oblique factor coefficients also listed in CSA and set up by COMP2 as 
columns adjoined to those of the other groups. 

Within a group, the correlation between two dimensions, the value 
Of which has been computed in CSA from the original correlation matrix, 
turns out to be a simple quadradic function of the P value, the collinearity 
measure used in CC factoring but here computed from the two columns 
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of oblique factor coefficients of the two dimensions. Since the correlation 
between two dimensions as given in CSA is the cosine of their central angle 
in SPAN, it is perforce an index of their collinearity; hence the index 
computed in COMP2 from P is called the “collinearity index” or simply 
cos 0, symbolized as L. 

The value of L between any two dimensions within a group can be 
directly compared with their interdomain correlation given in CSA for that 
group. These are in correspondence only if the definers of each dimension 
are themselves exactly collinear, usually only approximately the case. In 
sum, L is a general index of similarity between dimensions both within and 
across groups since there is no logical restriction that the dimensions be 
determined on the same group of individuals. 

A full-cycle V-analysis is finally executed on the Similarity matrix. 
This matrix is accepted as an r matrix by the CC factoring component. 
Structure analysis by CSA and SPAN describes all the relations among the 


dimensions within and across groups. From the results one can thus draw 
conclusions about the similarities and diff 


erences among the dimensions in 
the various groups. 


The BC TRY component СОМР1 


like the first part of COMP, is 
data cards for COMP2. COMP1 


er groups for input into COMP2. 
for the use of COMP2. If the 


of the data cards required by 
COMP2, or he may allow COMPI to do the work. 


The BC TRY component COMP2 


COMP2 performs the com 


: | parative cluster or factor analysis. This program 
'5 designed to calculate а 


set ; matrix of indexes of similarity among the adjoined 
S of dimensions and to store this matrix on the intermediate storage 


t 
аре for the use of other programs in the BC TRY System. 
СОМР2 is quite g 


column еп е in апу 
groups, the 
entries in t 


eneral and can perform comparative analyses of 
kind of matrix in different groups so long as, in all 
rows refer to common referents (variables or objects) and the 
he matrix are relations between the row and column entities. 
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COMP2 executes three major steps: (1) It inputs the COMPI cards 
and adjoins the factor matrices of the different groups, thus forming one 
large rectangular observation matrix. (2) At the user's option, it deletes 
any column dimensions (comparison entities) in any groups that are not 
to be compared with the other dimensions, and it also deletes any row 
variables not wanted as common reference entities. (3) It calculates indexes 
of similarity among all column dimensions to be compared and stores them 
as a correlation matrix (in IST file CORRM1), ready for full-cycle key-cluster 
analysis. 

Following these steps of СОМР2 it is, of course, the full-cycle key- 
cluster analysis of the similarity matrix (CORRM1) that reveals the degrees 
of identity of the dimensions of the different groups. 


Cluster or factor scores on individual objects 


The psychometric principles of forming composite scores on individuals 
have been known for a long time. These principles are basically grounded 
on domain sampling doctrine, namely, defining a domain of behavior, tak- 
ing a number of test samples of it, and forming a composite score on them, 
usually additively, as a better description of the attribute than one dealing 
with a single-test-sample score. In cluster or factor analysis, forming com- 
posites occurs at two stages: (1) Each of the NV specific attributes is 
usually a composite score itself consisting either of the sum of observed 
item samples or of an integrated judgment or rating composite of a com- 
plex multifaceted behavior domain. (2) After performing a V-analysis of 
NV such composites, one reduces NV by forming К supercomposites, 
each designed to measure a general attribute defined either by a cluster 
of S highly collinear variables or by all NV variables appropriately weighted 
to measure an independent or oblique dimension derived by factoring. 
When the composites are defined by clusters of variables, we refer 
to the scores as "cluster scores." When the composites are defined on all 
NV variables, we refer to the scores as ‘factor scores." The general 


formula for any composite may be written 
NV 
Ci = У iN (12.47) 
j=l 
where w; is a weight associated with the variable. Each object is evaluated 
by this equation, with the value of Х,, being the score of the pth object 
on the jth observed variable. How the weights are defined determines the 
type of composite obtained. It is assumed in this equation that the scores 


are standardized scores. 
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Cluster scores as a simple sum 


The Tryon method of scoring clusters is simply to sum the scores of vari- 
ables in each cluster. This corresponds in (12.47) to substituting 1.00 as 


the weight for variables in the cluster and .00 as the weight for variables 
not in the cluster: 


бу= Y Va (12.48) 


When these cluster composites are rescaled to standard scores, the simple 
sum is transformed to a differentially weighted sum that depends on the 
intercorrelation of the variable with the variables in the cluster. A definer 
with higher average correlations among the other definers will generate 


more variance in the composite score than those with lower average 
intercorrelations do. 


Regression estimates as composites 


A rather large number of methods and techniques for obtaining the weights 
in (12.47) have been developed. The major procedures among these are 
described by Harman (1967) and will not be discussed here. Two of these 
methods are included in the BC TRY component FACS. 


The BC TRY components FACS and FACS3 


Two separate components are available in the BC TRY System to calculate 
composites. Both FACS and FACS3 calculate cluster composites, but 
FAGS3 does, flot calculate regression estimate composites. The simple- 
Sum, or Tryon, method is straightforward in its application of (12.48). Each 
of the score dimensions is standardized in accordance with parameters 
input on data cards or standardized to a mean of 50 and a standard devia- 
tion of 10 if no parameters are input by the user. When FACS3 is used with 
missing data, the mean observed Score on the other variables in the 
cluster is inserted in the missing data cells and the cluster scores are 
standardized, 
The IST and DST input-output summaries are in Tables 12.1 and 12.2. 
Printed output of the components is the basic data defining the com- 
posites, the standard deviations of the composites, the effective weight 
matrices, the defining variables of clusters for the Tryon method, the input 
or regression-determined weight matrix, the intercorrelations between the 


The 


composite scores, and the composite score matrix. The composite score 
matrix is punched out at the option of the user, giving the title of the job, 
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the format in which the scores are punched (perhaps more than one card 
per object), and the matrix of scores punched in accordance with the 


format. 


O-analysis with EUCO 


f EUCO for O-analysis was extensively discussed in 
it in very much detail here. Two primary 
TRY component EUCO?; the first compares 
pares each 
called 


Because the logic 0 
Chap. 8, we do not describe 
options are available in the BC 
each of the Хх subjects with each other, and the second com 
of the VS subjects with a set of specially defined “subjects” 
OMARKs. Each of the .VS subjects is defined by a set of К scores. These 
Scores compose a vector indicating the location of the subjects as points 
in a K-dimensional space. By applying the simple geometry of Euclid, the 
n distance, of each point from the other points can be 


distance, or euclidea 
ance in the K-dimensional 


calculated. For the pth and the qth object the dist 
Space is given by 
iC" (12.49) 


efined by (12.48). This matrix of distances 
is output to the DST as a raw score matrix, and the O-analysis proceeds 
as though it were a standard cluster analysis. In the second option the NS 
subjects are compared against specially defined points in the score space 
called OMARKs. Each OMARK is defined as a point having all but one 
dimension score equal to the mean value of the scores. A set of six OMARKs 
is defined to represent a given score dimension, the scores being equal to 
—3, —2, —1, +1, +2, and +3 standard deviations from the mean on the 
score dimension, respectively. Thus for 11 dimensions there are a total of 
6 times 11 OMARKs defined in this Way. In addition, the average point is 
defined by an OMARK having average scores on all dimensions. Each of 
the Хх objects is compared with each of the / = 6K + 1 OMARKs to 
give a score matrix of 1/ by NS with the ХХ playing the role of the number 
of variables for the following O-analysis. The distance of each object from 
each of the 1/ ОМАКК is calculated by (12.49) for the pth object and the 
qth OMARK, where Су, is the score of the pth object on the jth score dimen- 
sion and С), is the score of the qth OMARK on the jth score dimension. 
This matrix of distances 15 stored on the DST as a raw score matrix with 
NS variables and 17 observations. The O-analysis proceeds as a standard 


V-analysis. 


The C values in (12.49) are d 
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O-analysis with OTYPE 


The procedures of the BC TRY component OTYPE were described in some 
detail in Chap. 8, and this section is therefore limited to the bare essentials 
of the statistical formulation of the component. The .V < objects are repre- 
sented in a K-dimensional space defined by the K cluster or factor scores 
from FACS. The K-dimensional space is divided into sectors by sectioning 
each of the K dimensions as specified by input parameters (either the 
number of sectors on a dimension or scores on the dimension specifying 
the sectors). Sectors in the A-dimensional Space that have more objects 
falling within their boundaries than specified by a parameter established 
by the user (or set to a standard value by the component) define the first 
approximation to O-types. For each of these O-types, a mean score on 
each dimension is computed; these Scores define a new “subject” that is 
the centroid of the trial core O-type. The distance (12.49) between each 
of these O-types is computed, and the closest O-types are “merged,” 
their scores are averaged, and the merged O-type takes the place of the 
two O-types just merged. This process is repeated until all the O-types are 
merged into a single O-type. The distances of the successive O-types with 
the remaining O-types are printed in the process. The merging is strictly 


an arithmetic process, the actual O-type groups remaining distinct in the 
program. 


Each of the V 
O-types. The objects 
Smallest euclidean di 
from a core O-type, 
in an adjacent secto 
Sectors, some of whi 
that were too spars 
included in an O-typ 
standard deviations 

Each of the O- 
progenitor. The cha 
its locus as measur 
locus, (2) the chan 
changes in the me 
the component, 
controlled lower p, 
one O-cluster fro 


5 objects is compared with each of the trial core 
are assigned to the core O-type with which they have 
stance. This results in the exclusion of some objects 
because the objects are closer to another core O-type 
r, and the inclusion of Some objects from adjacent 
ch may have been in other core O-types or in sectors 
ely filled to define a Core O-type. An object is not 
e if it is at a distance greater than с times the RMS 
from the O-type centroid (see Chap. 8). 
clusters now occupies a different locus than that of its 
nges in the O-clusters are revealed by (1) the drift of 
€d by the distance between its new locus and its old 
Bes in mean values on each of the K dimensions, (3) 
mbership of the O-cluster. These data are printed by 
If two O-clusters are closer than specified by a user- 
ound, the two O-clusters are condensed, thus eliminating 
m further consideration. 

At this point in the process the O-clusters that remain are treated 
in the same way as the initial trial O-types, and the process is repeated. 
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The process is reiterated until no O-type changes membership or until a 
user-determined number of iterations are performed. 

O-cluster homogeneities are calculated and printed. These values 
are described in detail in the section below on OSTAT. The means of scores 
on the K dimensions of all O-clusters are punched at the option of the user. 

The formalisms of OTYPE are as follows. Let the matrix of cluster or 
factor scores be (',, for the ith dimension and the jth object. Let the mean 
cluster score within the gth O-cluster be denoted С for the ith dimension. 
The distance between the jth object and the yth O-cluster is defined by 


g————— 

Lig = "b (Са — Си) (12.50) 
i=l 

luster for which (12.50) is smallest, 


Each object is reassigned to that O-c 
is smaller than 


contingent on the restriction that the value 1, 
a 
"TE с? (12.51) 
ізі 


where c is a constant and о? is the variance of the ith score dimension. 
The mean dimension score of an O-cluster will change when the member- 
Ship of the O-cluster changes. Let Quo be the mean of the yth O-cluster 
on the ith dimension for the (»& — Dst iteration. On the nth iteration the 
mean is (4), and the difference in means is 
Gg 65 


А.С = Cio 
ed in terms of the total euclidean 


42 (AnCio)? (12:53) 
ізі 


e computed from the centroids of the 


(12.52) 


The drift can be defin distance covered by 


The intercluster distances ar 
O-clusters = 


к 
Lo = Із (Са — Cin)? (12.54) 
i=l 


Lan is computed at each iteration, and the two O-clusters with the smallest 
eir distance is less than a quantity formed like 
nstant for cluster condensation. The new 


alculated by 


L value are merged if th 
(12.51) with the appropriate со 
mean of the merged cluster is c 
SC + SiCin (12.55) 
S, + 5 Á 
number of objects in the respective clusters. The 


where the S's are the 
ster and the con- 


intercluster distances are recalculated for the merged clu 
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densation procedure is repeated until no distance is smaller than the 
condensation criterion. 

The component program of BC TRY, OSTAT, is used to calculate a 
variety of statistics on the O-clusters from the dimension scores used to 
determine the O-clusters. OSTAT lists the K cluster or factor scores of the 
members of each of the O-clusters produced on the last iteration of 
OTYPE; for each O-type OSTAT computes the means, standard deviations, 
and homogeneities of scores on each dimension, correlation ratios » for 
each dimension on the discontinuous series of O-clusters, and the overall 
homogeneities of each O-type across all dimensions. Two values of у аге 
calculated, one for the actual scores of objects in O-clusters using only the 
objects in O-clusters, and a Tryon n, which uses the variance of scores in 
the entire sample of objects whether or not they fall in O-clusters. The A 
cluster or factor scores are now differentiated according to the O-type in 
which a subject is located. Letting the third Subscript designate the O-type, 
we have for the ith dimension and the Jth subject in the yth O-type the 


cluster score С. Let Л/ stand for the number of O-types. The O-type 
means are simply 


„== > Gn (12.56) 


where there are 5, objects in the (th O-type. The variance within an O-type 


for a given cluster Score dimension is given by 


> 1 в ге 
ir 5 5 Cis = Ci (12.57) 


The variances of the cluster 


5 Scores initially, for the entire sample of objects, 
aS Oe TOP ae Lan AE. 


: The squared homogeneity for the yth O-cluster 
оп the 7th cluster Score dimension is 


Hh = 1% (12.58) 
22 

The square root of the 

O-type on the dimensio 

quantity (12.58) is neg 

of the square root of 

homogeneity of an O.c 


quantity given in (12.58) is the homogeneity of the 
n when the quantity in (12.58) is positive. When the 
ative, the homogeneity is defined as the negative 
the absolute value of (12.58). The overall squared 
luster, across all K cluster dimensions, is given by 


ig 


K 
У о? 
ДЬ (12.59) 
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The homogeneity is the square root of this quantity, with the same provi- 
sion for negative values as above. The Tryon » is given by the square root of 
M 
> SH, 


19 
== (12.60) 
У 5, 
g-l 
The standard у is given by the square root of 
M 


> ут 
ded == (12.61) 
7 2 (ба/- 00” 
и=1ј=1 
where 
ar Ж 
у У Си 
бре Lb (12.62) 
Ў So 


Typological prediction 


The BC TRY component 4CAST applies the basic principles of computer 
simulation or Monte Carlo estimation to the problem of determining the 
degree to which a dependent variable is differentially distributed within 
the O-clusters. 4CAST computes the mean, standard deviation, and homo- 
geneity of scores on a variable, generally a dependent variable in the data 
set on the DST or оп a dependent cluster score in the factor score file 
on IST. The full set of УУ scores on the dependent variable, the predicted 
variable, are brought into memory of the computer, and the equations of 
OSTAT pertinent to a single variable (K — 1) are applied to the variable 
for the O-clusters from OTYPE. In addition, for each O-type, many random 
samples of S, scores from the full set of ХХ scores are selected, and the 
mean, standard deviation, and homogeneity of each sample are computed. 
These statistics on the random samples are tallied against the empirical 
Statistics. The relative frequency with which the random samples give 
statistics more extreme than the empirical statistics is printed out. The 
distribution of sample means, standard deviations, and homogeneities is 
printed in histogram form. 

The workings of the program are straightforward, parallel with the 
prose algorithm and the statistical procedures of OSTAT. However, the 


random sampling procedures require comment. If a very large number of 
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the computer time involved can 
ee E iiic to take care in constructing the 
inen iem pis uar Wn the random samples. The tation number 
пера ју «под of selecting the scores should be efficient, апа 
eS бырде Оне enerator should have certain statistical properties. 
€ poten ir ek to Lehman and Bailey (1968) for an introductory 
|. бо issues and references to more advanced discussions. 
discuss 


Miscellaneous BC TRY components and procedures 


The BC TRY System contains a number of component programs ar specially 
named procedures performed by component programs. This section 
reviews these components or procedures in brief. The intent is to give a 
thumbnail sketch of the peripherally important components of the system 
but not a detailed statistical description of the components. Many of these 
components are simply housekeeping Programs to assist in the manage- 
ment of data, or they are programs to perform standard statistical calcu 
lations, e.g., DAP to input data and СОВ2 to calculate a correlation matrix. 


Access 


Since BC TRY is a very large system, it must be maintained in the library 
of the computer center, available to users through submission of job decks 
containing only cards that make a request to the computer for the use of 
the System апа control and data cards for the system itself. The mecha- 
nism to establish accessibility of the System on a computer varies with 


аи t 
the computer installation. In the next chapter this is referred to as par 
of the local conventions of a job deck makeup. 


General executive program, GEP 
The GEP was described in 


Some detail in Chap. 3. Some of the procedures 
described below are actua 


lly a part of the СЕР. 


Binary tape dump, BTD 
If, for any reason, the user of tke S 


t 
ystem becomes confused about wha 
he has on the IST, he can obtain a ' 


"Picture" of the contents of the IST by 
including the EC card calling for a binary tape dump, /BTD. This procedure 
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is particularly useful in determining the status of the IST after it is defined 
by a TAKE deck. 


COMMENT 


The user can have as much explanatory material printed in his output as 
he wishes by including EC cards (at places where EC cards are expected 
by the System) with /COMMENT followed by cards containing the material 
he wishes to have printed. The material is printed, line by line, as it is 
encountered by the GEP in its attempt to find an EC card after the /COM- 
MENT card. 


Correlation, COR2 апа COR3 
These two components simply calculate the Pearson product moment 
correlation matrix for the data input through DAP. COR2 assumes that 
all the data are defined. COR3 assumes that there are missing data and 
calculates the correlation coefficients on objects for whom data are mutu- 
ally defined. A variety of other statistics are output by COR3, including the 
matched sample sizes, covariances, standard deviations, etc. 


Data processor, DAP 
The data processor program simply inputs to the DST the basic raw data 
of a study. DAP permits one to identify each variable by name and each 
object by name if so desired. DAP is capable of reflecting variables, of 
reordering variables, and of accommodating to missing data. Means and 
standard deviations of variables are calculated and output. 


Data printing, DPRINT 
This EC card causes all the data on the DST to be printed, clearly labeled 
d observation name and sequence number. Problems 
either because of keypunch errors, faulty format 
failure, are spotted easily by scanning this output. 


as to variable an 
caused by faulty input, 
Statements, or computer 


Least squares factoring, FALS 
Whereas we described the processes and statistical procedures of cluster 


analysis in some detail, we refer the interested reader to other sources, 
е.в., Harman, for a discussion of least squares factor analysis. The BC 
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TRY component FALS computes factors by three different methods, the 


ргіпсіраі-ахев method, the augmented factor analysis method, and the 
canonical factor analysis method. 


Factor statistics, FAST 


The factoring program FALS does not print out the residual matrix, and 
the program FAST is included in the System to do this. FAST also can print 
out the successive residual matrices or the successive reproduced matrices. 
The IST is modified by FAST so that the reproduced or the residual matrix 
takes the place of the correlation matrix in file CORRMI. This permits a 
cluster or factor analysis of the residual matrix. 


Input and output to the IST, GIST 


The component GIST (Generate IST) enables users of the System to input 
to or output from (on punched cards) the IST. This component permits 
recovery of data in keypunch form from a BC TRY run. It also permits the 
user to set up nonstandard uses of the System by defining IST files in 
Special ways, independently of the use of certain BC TRY components. 


Restart, GIVE-TAKE 


The components GIVE and TAKE permit the user to capture the IST at 
any point in a run where an EC card is expected by GEP. If at a later date 
the user wishes to restart the analysis at that point, he uses the deck of 


cards containing the IST files to reestablish the IST precisely as it was 
defined when he used the GIVE EC card. 


Missing data statistics, RLIST 


The missing data correlation 
describing the matched sam 
matrix at a time, RLIST output 
appearing together in the pri 


program COR3 can output all the statistics 
ples, etc. However, these are output one 


5 the statistics for each cell of all the matrices 
ntout. 


Correlation matrix organization, REDE 


For purposes of publication, a BC TRY user may want to reorganize а СО!" 
relation matrix. In order to reorder the variables, delete variables, and 
reflect variables in a printed matrix the BC TRY user would use the com- 
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ponent REDE. This component does nothing but print out the reorganized 


matrix. 

Scatter diagrams, RSCAT 
The scatter diagram of the joint distribution of two variables is printed out 
by the component RSCAT. Also included in the output are the frequency 
distribution of each of the variables, the Pearson product moment correla- 
tion coefficient, and the coefficient of nonlinear correlation 7. 


Temporary suppression of variables, 
SLEP1 and SLEP2 


There are two principal reasons why an investigator would want to sup- 
press variables in a cluster or factor analysis until after factoring is com- 
pleted: (1) To suppress experimentally dependent variables, i.e., those 
which include other of the NV variables as physical components; such 
variables are likely to generate communalities over 1.00 іп factoring. 
(2) To implement an experimental design in which some variables are 
suppressed from factoring in order to see the portion of their variance 
accounted for by the dimensions factored from the nonsuppressed vari- 
ables. SLEP1 is the component that suppresses variables from factoring 
in order to see the portion of their variance accounted for by the dimen- 
sions factored from the nonsuppressed variables. SLEP1 is the component 


that suppresses the designated variables. SLEP2 completely reactivates 


the suppressed variables after factoring is completed and recomputes 


factor statistics on all ХІ variables in the customary summary form. 


Matrix-algebraic operations, SMIS 


This component performs general matrix-algebraic operations and pro- 
vides the user with almost unlimited capacity for matrix and vector opera- 
tions with access to the IST. Each operation defined under SMIS is initiated 
by the user by control cards. By combining the operations provided in 
SMIS the user can perform virtually any calculation desired, even though 
the calculation is not an integral part of the other components of BC TRY. 
Hence, with creative use of SMIS pioneering work can be done in multi- 
мапаје analysis within the setting of BC TRY. 


Chapter 13 


ABRIDGED USER'S MANUAL OF THE BC TRY SYSTEM! 


of the User's Manual of the BC TRY 


System. The unabridged manual, which contains detailed descriptions 
of all control options in the System and extensive discussions ofthe purpose 
and methods of the components of the System, makes up approximately 
200 pages of computer-printed material. Most of the general discussion 
in the Manual is covered in this book. In most applications of BC TRY the 
Control options are standard. In standard options the control cards of 
BC TRY are generally blank or have stereotyped punches. As a conse- 
quence, for most applications the card decks required are quite simple. 
This chapter presents those segments of the Manual needed for standard 


use of the System. 


| his chapter is an abridgement 


General conventions 


s are required in each BC TRY run: local 


Three general kinds of card 
tive control cards, and component control 


computer center cards, execu 


! This chapter was written with the collaboration of John Bauer and Jin-Yu Yen. 
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TABLE 131 STANDARD SEQUENCES AND DECKS BY 
TYPES OF ANALYSES? 


V-analysis 
Principal 
Empirical CC5 axes-varimax Preset CC5 
Program ID Program ID Program ID 
Local a Local a Local a 
START b:1 START b:1 START b:1 
COMMENT c:24- COMMENT с:2-- | COMMENT c:24- 
DAP2 d:54- DAP2 d:5+ | TAKE d:deck 
DPRINT 631 DPRINT 8 DVP41 е:2 
COR2* fa COR2^ Ға CC5 ја 
DVP41 g:2 DVP41 g:2 CSA2 g:3 
СС5 һ:3 FALS h:5 SPAN2 ћ:2 
CSA2 ©З FAST 1:2 СІУЕ 1:1 
5РАМ2 22 GYRO 1:4 СОМР1 i 
GIVE 5:1 GIST k:3 END Bel 
END С 5РАМ2 1:2 Local bi 
Local т: СІУЕ ті 
END пл 
Госа! о:1 
Then Preset CC5 
O-analysis 
T ч 
Condensation Method Convergence Method Бі дені теті 
Рговгат ID Program ID | Program ID 
Local a Local a Local a 
START 0:1 START b: START 81 
СОММЕМТ с:2-- COMMENT с:2-- | COMMENT с:2+ 
DAP2 4:5+ DAP2 d:64- | DAP2 d:54- 
ЕЕ e:deck |DPRINT е1 |ТАКЕ e:deck 
Бес Ла GIST 1:9+ |4CAST Ј:2+ 
АТ (opt)  |EUCO2 9:2 |END gil 
OTYPE 4:2 COR2 hil Local h:1 
OSTAT h:2 DVP71 %%2 
СІУЕ 251 CC5 їз 
END јл СЅА2 k:3 
Local k:1 ЗРАМ 1:22 
GIVE т:1 
END пл 
Local 0:1 


aU m _ _ | 
deed ene bs 1 identifications of the cards (following the sequentia 
required m qu. SS in the text); after the colons are the minimal numbers? 

Pares. Tiet d d sequence num- 
bers of a standard deck. able can be used to check the card and seq 


^ COR3 is used if there are missing data. 


е Sequence 


A 
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cards (including data cards). Because of the wide variety of requirements 
imposed by local computer centers the local computer center cards are 
not described in detail here. Although there may be different requirements 
in various installations, there are generally three types of local cards: job 
cards, monitor control cards to access BC TRY in the computer center 
library, and end of job cards to signal the end of the BC TRY run. Wherever 
local cards are likely to be required, they are indicated in this chapter by 
the phrase "local cards" or simply "local." 

Instructions for punching cards are stated in detail, or the columns 
in which literal information is to be punched are indicated along with the 
literal information in quotation marks. For example, on the first executive 
control card of a BC TRY job the literal “/5ТАКТ” is punched in columns 1 
through 6. This is indicated in the following by 

E recutive card 

1-6 "/START" 


which means "punch a card with the characters/START in columns 1 


through 6." If other columns are to be punched, they are indicated in a 
like manner. The phrase ''pack right" in instructions below mean simply to 
punch the indicated number to the right of the field indicated. The value 41 
punched in columns 5 through 8 appears in columns 7 and 8 when packed 
right. 

Table 13.1 gives six standard sequenc 
component cards. These sequences cover the mos 
of BC TRY. 


es of BC TRY executive and 
t common applications 


Empirical key-cluster analysis 


а. Local cards. 
b. START card, activating the BC TRY Syste 
1) Executive card. 
1-6 "/START" 
c. COMMENT cards, for со 
more cards. 
1) Erecutive card. 
1-8 “/СОММЕМТ” 


2) Component cards. 
1-72 Punch messages оп as many such cards as desired. 


d. DAP2 program: prepares data, reads and error-checks raw scores for 
input to DST (Data Storage Tape). live or more cards plus data deck. 


m. One card. 


mments about the job. Optional. T'wo or 


1) Executive card. 
1-5 "/DAP2" 
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2) Component cards. 
Саға A. Title card. 
1-72 Title of problem to be printed on all output. 
Card. B. Parameters and controls. Pack right. 
1-4 Number of variables, NV. 
5-8 Number of subjects, VS. 


9-12 Le 


ngth of field (LFIELD), i.e., number of columns used per 


variable on raw-score data cards. If LFIELD exceeds 9, or is 
not the same for all variables, or if data are in decimal format, 


or 


if NV is only part of the total number of variables on data 


cards, or if ONAMS are not on columns 73 to 78, LFIELD 
must be left blank and card C (format card) below must be 


prepared. 

13-16 A‘'1" punch means data are complete; a blank means missing 
data. 

Саға С. Format card, necessary only if columns 9-12 of card B 


Саға D. 


Cards E, 


Data cards. 


above are blank. For details of preparing card C, see section 
headed “Format statement in DAP2 and GIST” below. 
Variable names, called V-names, providing for the printing 
of user-assigned names of variables on output. The pro- 
gram automatically assigns V-names if user does not. 
Card 1. Directs program to input V-names. 

1-14 "VARIABLE NAMES" 

Card 2. Detail cards of V-names. 

1-72 Punch variable names in six or less alphameric char- 
acters, separate each name by a comma, and punch à 
period after the last name. Use as many cards as necessary. 
Do not split a name between cards. 

F, G. For optional reordering or reflecting variables and 
for supplying O-names. 

The deck of raw scores comes last, preceded and fol- 
lowed respectively by control card 1 and control card 2. 
Card 1. Directs input of data to DAP2. 

1-4 "DATA" 

Data Deck is inserted here. It is prepared as follows. The 
МТ raw scores of each individual are punched in fields of 
constant length, LFIELD (number of columns). LFI1 ELD 
cannot exceed nine. А missing score is left blank in its field. 
The O-names are punched in columns 73- 78 on each data 
card of each individual. A score must not be split between 
two cards. The sequence number of a given individual's 
cards is punched in columns 79 and 80, packed right. 


Card 2. Signals end of data input. 
73-75 "END" 
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€. 


h. 


DPRINT program, printing out the raw scores as written on DST by 
the DAP2 program. Optional, but desirable as a check to be sure data 
are correctly input to the computer. One card. 
1) Executive card. 

1-7 “/ОРКІМТ” 
CORRELATION (COR2) program for complete data: computes correla- 
tions among the variables and other statistical constants of them. 
One card. 
1) Executive card. 

1-5 "/COR?" 
CORRELATION (COR3) program for missing data. T'wo cards. See 
below in section headed ''Optional programs in V-analysis’’ for controls. 


DVP program: estimates communalities for input to the diagonal cells 
of the correlation matrix. (It can also insert unities.) Тісо cards. 


1) Executive card. 


1-4 "/DVP" 
2) Component card. 
3-4 "41" 


(C5 program: determines К, the number of dimensions sufficient to 
reproduce the correlation matrix; uses diagonal values, reiterates and 
computes residuals; defines dimensions by key clusters (subsets) of 
variables. Т/гес cards. 
1) Executive card. 

1-4 "/СС5" 
2) Component cards. 

Саға А. Blank. 

Саға В. Blank. 
CSA2 program: computes factor coefficients on oblique cluster 
domains (factors), interdomain (common factor) correlations, relia- 
bility of cluster scores (accuracy of factor estimates), and other 
aspects of cluster or factor structure. Three cards. 
1) Executive card. 

1-5 “/С5А2" 
2) Component cards. 

Card А. Blank. 

Саға В. Blank. 
SPAN2 program: based оп the results of factoring, the program 
allocates the variables to a minimal set of subspaces and plots the 
variables as points on the surface of a minimal set of salient spheres, 
physically rotated for visually centering the configuration. Тіго cards. 
1) Executive card. 

16 “/5РАМ2” 
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k. 


m. 


2) Component card. 
Card A. Blank. 


GIVE restart card, producing an output deck of binary cards that 
contain IST files produced by the above programs. One card. 
1) Executive card. 
1-5 "GIVE? 
The printout from GIVE gives the title of an output TAKE card and 


a list of the IST files punched on the output binary GIVE deck. These 
are: 


VNAMSI (names of variables) 

VSUMSI (sums of variables) 

MEANSI (means of variables) 
ONAMSI (names of individuals) 

IPFILE (master identification) 

IDFILE (title) 

STDEVI (standard deviations of variables) 
CORRMI (correlation matrix) 

CLUSTI (cluster definers, i.e., indices) 
REFLXI (cluster definers with signs) 
DIAGVI (diagonal values) 

UFACTI (orthogonal factor coefficients) 
RFACTI (rotated factor coefficients) 


BASISI (correlations between cluster domains, factors) 


Along with this deck are various local cards, which must be present in 


the restart of the job through the TAKE component. These local cards 
vary from installation to installation. 


END of job. One card. 
1) Executive сата. 

1-4 “/END" 
Local cards. 


Preset key-cluster analysis 


Aftera Study of the out 
wish 
Perh 
restarting the anal 


Put of the empirical key-cluster solution, one usually 
65 to revise the defining variables of one or more of the dimensions, 
арз even to eliminate some dimensions entirely. This is done by 


ysis with the TAKE deck that was output by the GIVE 
restart сага. 


segme 


The revised run is identical with the empirical run except that (1) the 


nt that includes DAP2, DPRINT, and COR2 (or COR3) is now 
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replaced by the TAKE deck input and (2) the preset features of the 665 
program are now utilized. Here is the full sequence: 


a to c: Local, START, COMMENT. 

d. TAKE program: the Intermediate Storage Tape, IST, is restored with 

files that are needed by the CC5 program, namely, IDFILE, CORRMI, 

and DIAGVI. A TAKE is composed as follows. 

1) Executive card. 

1-5 “/ТАКЕ" This card is included in the Hollerith (BCD) part of 
the punch output from GIVE. 

2) Binary data cards. 

These cards of the GIVE deck contain the data to be restored on 
the IST. 

DVP program: same as the empirical run. Two cards. 

f. СС5 program: same as the empirical run, except for punching several 
parameter values on component card A, and detail cards informing 
the program of the revised defining variables of the dimensions. Three 
cards plus deck of detail cards C. 

1) Executive card. 
1-4 '"/CC5" 
2) Component cards. 


Саға А. Parameters. Pack right. 
1-4 "4" informing program that the number of dimensions is 


preset in columns 17-20 below. 
9-12 "3" informing program that the defining variables of each 
dimension are provided in detail card C below. If only the number 
of dimensions is to be preset, leave blank. 
17-20 Punch K, the number of dimensions in this preset run. 
Card B. Blank. 
Саға C. Detail cards specifying definers of preset dimensions. 
Punch a detail card for each of the K dimensions, giving the ordinal 
numbers of not more than 20 definers of the dimension in integer 
card format (successive four column fields, beginning in column 1 
and ending in column 80; use no punctuation marks; pack numbers 
to the right in each field), without reflection signs (indicators) 
attached to the definers. Skip this card if only the number of 
dimensions is to be preset. 
g. CSA2 program: same as empirical run. Three cards. 
h. SPAN2 program: same as empirical run. Two cards. 
i. GIVE restart card: same as empirical run. When the GIVE deck is 
received from this preset run, destroy the GIVE deck received from 
the empirical run. One card. 


~: 
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J. Optional COM P1 Program, for a later comparison of the dimensions 
found here with those discovered in other groups, as described in 


section 3 in "Optional programs in V.analysis'" for program COMP2. 
One card. 


1) Executive сата. 
1-6 "/СОМР1" 
k. END of job: same as empirical run. 
l. Local cards: same as empirical run. 


Optional programs in V-analysis 


l. Correlation. program for missing data, COR3 (in place of COR2): 
computes correlations and covariances among the variables and other 
statistical constants. T'iro cards. 

1) Егесшіге саға. 
1-5 "/CORS" 
2) Component card. 

Punching а "1" 


in each of the columns 4, 8, 16, 20, 40, will give the 
following: correla 


tion matrix, standard deviations, matched .V's. 

2. Correlation scaltergram program, RSCAT: plots the correlation scatter 
between any or all of the УП" variables or К cluster scores, computes 
for each scatter the value of r, both 15, the regression constants, 
frequency tables, and other Statistical values. Two cards, unless 
Selected scattergrams are Specified, which requires a detail card. 

1) Етеси те сата. 
1-6 "/RSCAT" 
2) Component card: one or more cards. 
Саға A, 
1-4 Blank. Compute the Scattergrams between all the variables or 
clusters. (VV or K not greater than 18.) 
"2" Compute the scattergrams between all variables or clusters 
апа a common variable or cluster. Designate the ordinal number 
of the common variable/cluster in columns 17-20. 
"3" Compute only scattergrams specified on card B. 
5-8 Blank. Scattergram between clusters. FACS required before this 
Program. 
72" Scattergram between variables. DAP2 required before this 
Program. 
17-20 Necessary only when punched “2” in column 4. Punch the 
Ordinal number of common variable or cluster. 
Саға B, Necessary only when punched "3" on column 4, card A. 
Punch “1+3, 244, 104-20." for the combinations of 1 and 3, 2 and 4, 
10 and 20. Only 24 scattergrams can be specified. 
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3. Comparative cluster analysis of the dimensions found in different 
groups: performs a comparative analysis by СОМР2 of the key-cluster 
dimensions discovered in separate groups of individuals that are 
measured on the same variables. COMP1 prepares the decks to be 
input to COMP2. СОМР2 adjoins the input factor matrixes of the 
separate groups, computes similarity indexes between all dimensions 
in the form of a correlation matrix on which a full cycle empirical key- 
cluster analysis is then performed. Тіго cards plus the COMPI data 
cards following cards a through c of an empirical key-cluster analysis. 
1) Executive card. 
1-6 “/СОМР2” 

2) Component cards. 
Саға А. Title card. 
1-72 Punch title card describing the analysis. 
Card. B. 
1-4 Punch the number of the groups to be compared. 
Data Cards from COMPI that were output during the preset Кеу- 
cluster analysis, step j: these COMPI cards from each preset 
run should be interpreted, then stacked together and inserted after 
card B. Use the whole СОМРІ punch output as data cards. Do not 
remove any cards. 

3) Full cycle CC5 analysis: follow the cards for the COMP2 program 
by exactly the same control cards of steps y to m of the empirical 
key-cluster solution, namely, by DVP, CC5, CSA2, 5РАМ2, GIVE, 
END, and local. 

4. Principal axes, varimax analysis: computes an orthodox principal-axes 
solution with varimax or quartimax rotation, from which a cluster 
structure analysis can be derived, if desired. Required input files: the 
programs that must precede the principal-axes program are identical 
with those that must precede the CC5 program, namely, those from 
steps a through y of an empirical key-cluster analysis. Here is the full 
Sequence: 

a through g: local, START, COMMENT, DAP2, DPRINT, COR2, DVP41. 

Same as above. 

h. FALS, using the principal factor analysis option. Five cards. 

1) Erecutive card. 

1-5 '"/FALS" 

2) Component cards. 

Саға А. Specifies the type of least squares solution desired. 
13 “PFA” 

Саға В. Blank. 

Саға С. Blank, but if one wishes to preset the dimensionality, 
then in columns 1-4 punch the number of preset 
dimensions desired. (Pack right.) 

Card D. Blank. 
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k. 


m. 


n. 


FAST: computes the matrix of residuals after the last principal 
axis dimension. T'wo cards. 
1) Executive card. 
1-5 FAST” 
2) Component card. 
1-8 "RESIDUAL" 


. GYRO program: computes the varimax rotation of the principal 


axes calculated above in FALS. Four cards. 
1) Executive card. 
1-5 "/GYRO" 
2) Component card. 
Card A. Specifies method of rotation. 
1-6 "МАКМАХ" (for а quartimax rotation, punch “ОВТМАХ”). 
Саға B. Blank. 
Card С. Blank. 


GIST program for the output of rotated varimax factor coefficients 
on a deck of cards (а GIST “output package"). Reason: the factor 
coefficients of some dimensions may be predominantly negative but 
can be made positive by duplicating the coefficients on a new card 
with reversed signs, using an ordinary key punch. Three cards. 
1) Executive card. 

1-5 "/GIST" 
2) Component card. 

Саға А. Call output file package. 

1-6 "RFACTI" 

Card B. Blank. 


АРАМ? program: provides the spherical configuration of the 
variables on the varimax dimensions. 7100 cards. 
1) Executive card. 


1-6 "/SPAN2" 
2) Component card. 
5-8 "2" 


GIVE restart card. One card. 
1) Executive card. 
1-5 "/GIVE" 
END of job. One card. 
1) Executive card. 
1-4 “/ЕМр” 


- Local card. 


NOTE 1: If the SPAN configuration is distorted because some 


dimensions are preponderantly negative in the signs of their factor coeffi- 
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cients, reflect the factor coefficients as indicated under GIST above and 
input by GIST the new RFACTI file package. 

NOTE 2: It is always possible to get апу desired sphere by calling 
for it. 

NOTE3: From the SPAN configuration choose the defining variables 
of the K clusters that are most nearly independent and meaningful and 
run a preset CC5 analysis. The CSA2 and SPAN components provide an 
oblique solution to these varimax-defined clusters. 


Condensation method of O-analysis (OTYPE analysis) 


The condensation method is an iterated condensation of the .VS individuals 
around a small number 1/ of core O-types. The full sequence follows: 


а through d: local, START, COMMENT, DAP2. Same as empirical run. 

e. TAKE. Identical with preset key-cluster run but using the GIVE deck 
output by that run. Same as preset run. 

f. Cluster. scoring program, FACS: computes simple sum 2 scores 
(mean = 50, standard deviation = 10) on each cluster whose definers 
are selected in the preset key-cluster analysis or in the initial empirical 
analysis if that is considered satisfactory. Four cards. 

1) Executive card. 
1-5 "/FACS" 
2) Component cards. 
Card А. Blank. If cluster scores are wanted оп an output deck (the 
FSCORI deck), punch “1” in column 12, 
Саға В. Blank. 
Саға С. Punch іп columns 1-72 ап identifying title of this FACS run. 


NOTE: The defining variables of the clusters can be preset differ- 
ently from those in the REFLX1 file of the GIVE deck by GIST input of 
REFLXI file package inserted after TAKE. 


д. OTYPE program: assigns all the NS individuals to arbitrary sectors іп 
the cluster score space of K dimensions; selects as core types those 
in sectors which include at least 2 percent of the group; reassigns all 
NS individuals to those core types; iterates the process to a final 
solution consisting of Л/ O-types plus a set of unique individuals; and 
computes the hierarchical order of the 1/ O-types. Two cards. 

1) Executive card. 
1-6 "/OTYPE" 
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2) Component card. 
Саға А. Blank. 


h. OSTAT program: lists the K cluster scores of the members of each 
of the 1/ O-types produced by the last iteration of program OTYPE; 
for each O-type, computes its means, sigmas, and homogeneities of 
scores on each dimension; and calculates the correlation ratio 7 of each 


dimension оп the discontinuous series of 1/ O-types. Two cards. 
1) Executive card. 


1-6 '",OSTAT" 
2) Component саға. 
1-4 “2” 


13-16 “1” which outputs OMEANI cards of the 1/ O-types for the 
convergence method below. 

i. GIVE, the restart card: reestablishes the IST files. To do so is important 
for two reasons: (1) by inputting this GIVE deck (via TAKE) and calling 
OTYPE again, if more iterations are desired after looking at the results, 
they can be continued from where they left off at the end of the last 
OTYPE run, and (2) if one wishes to perform typological prediction (see 
below), this GIVE deck reestablishes the OTYPEI file after the last 


iteration, a necessary input file for the use of the 4CAST program in 
prediction (see below). One card. 


1) Executive card. 
1-5 “/СІМЕ" 

Ј. END card. One card. 
1) Executive card. 
1-4" END" 

k. Local card. 


Convergence method (EUCO analysis) of O-analysis 


The convergence method is called EUCO analysis. As a general method of 
Projecting all individuals into a single spherical configuration from which а 
core set of Л/ O-types is selected as descriptive of the O-cluster structure 
in the full supply of subjects, EUCO analysis is expensive in computer time 
and lacks the objectivity of the condensation method, which now super- 
sedes it as the means of discovering the typology of a group. 

But EUCO analysis is nevertheless critically important as a means 
of revealing the spherical configuration of the O-type structure. Hence, 
it is now included as an essential step following the description of the 
typology by the condensation method. Only the 1/ O-types are carried 
into EUCO analysis, these being abstract individuals whose score profiles 
on the А cluster dimensions are output as the OMEANI file from the 
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OSTAT run in the section above. In addition, hypothetical marker "'indi- 
viduals" are also projected into the configuration by means of which the 
Z score axes are represented in the configuration so that it becomes 
possible to read off approximately the actual score patterns of the differ- 
ent O-types. The model markers are called the “OMARK objects," whose 
score profiles on the K dimensions аге also output by the OSTAT program 
for use in the procedure described below. The full procedures follow: 


а to с: Local, START and COMMENT: same as empirical run. 

d. DAP2 program: used here to set up a problem title (IDFILE) and to 
provide the O-names (ONAMS1 file) of the O-types and OMARK objects 
that are projected into the spherical configuration. Three or more cards 
plus data deck. 

1) Executive card. 

1-5 "/DAP2" 

2) Component cards. 

Саға A. Title card. 

172 Title of problem to be printed on output. 

Саға В. Parameters. Pack right. 

1-4 Punch K, the number of Z score dimensions on the input data 

cards, below. 

5.8 Punch .VS, the number of objects in this particular EUCO 

analysis, these being the 1/ O-types whose OMEANI cards are in 

the data deck plus the OMARK objects whose scores are also 
output by OSTAT. (Of course, any other kinds of cases may be 
added.) 

Card С. Insert a format statement here. OMEANS punch output 
is in F8.4 format (see below). 

Саға D. Variable names: in this case punch the names that have 
been assigned to the К dimensions. See the V-names 
procedure described above. 

Cards E, F, (т. Usually not required. 

Data Cards, This is the data deck of ОМЕАМ1 and OMARK cards, 

input thus: 

Саға 1. Initiating input. 
1-4 "DATA" 

Data deck is inserted here. 

Саға 2. Terminating input. 
73-75 "END" 

е. DPRIN T program: provides a listing of the input cards. Same as above. 
Spot check this listing. Оне card. 

J. GIST input of IST files needed by program EUCO later. 

1) Елесине card. One card. 

1-5 "/GIST" 
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a 


· EUCO program: for COR2 below, 


2) Component cards for MEANS2. Two cards. 
Card A. Input file package for MEANS2. 


1-6 "MEANS2" 
9-12 "I" 
Card B. Value of MEANS2. 
3-4 ''50" 
3) Component cards for STDEV2. Three cards. 
Card A. Input file package for STDEV2. 
1-6 "STDEV2" 
9-12 ay" 
Card B. Number of dimensions (factors) K. 
1-4 Punch К, that is, the value of K. 
Саға C. Values of STDEV2 on the K dimensions. 
3-4, 11-12, 19-20, 27-28, etc., for K such fields, punch “10” in each 
field. 
4) Component cards for FSCORI. Two or more cards plus NS cards 


of Data Deck. 


Саға А. 
1-6 "FSCORI" 
9-12 "1" 


13-16 Blank for standard format, i.e., successive eight-column fields 
with decimal point at fourth place from the right. 


"1" for nonstandard format. Format statement, card B, is 
required. 

Card B. Necessary only when column 16 of card A is punched "1". 

Punch format statement with open- and close-parentheses 


in column 1 through 72. No “INFORMAT” is required as in 
DAP2. 


Number of column variables and row objects on data cards. 
Pack right. 
1-4 Punch K, that is, the value of K. 


5-8 Punch NS, the number of subjects in this particular 
EUCO analysis. 


Data Deck of FSCORI cards. 
5) A final blank card is mandatory. One card. 


Card C. 


uo computes the D values of each of 
5 objects taken as column comparison entities with a common 


Set of row referent objects that are a Special set of OMARK objects 
that spans the K-dimensional scor 


between the NS objects, Two cards 
1) Executive card. 

1-6 "/EUCOQ?2" 
2) Component card. 


the N 


е space; also computes D value 


17 20 "1", which sets us the D matrix with the common row referents 
as OMARK objects. 
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h. Correlation (COR2) program: computes the rpp values between the 
NS objects where the row referents аге the OMARK objects that span 
the score space. One card. 

1) Executive card. 
1-5 "/COR2" 

i. Diagonal values (DVP) program: inserts unities in the diagonal cells of 
the correlation matrix. Two cards. 
1) Executive card. 

1-4 "/DVP" 
2) Component card. 
3-4 "71" 

j to o: CC5, CSA2, 5РАМ2, GIVE, END, and local. Same as / to m of the 
empirical run. The spherical configuration of the objects with the 
OMARK score axes projected into the configuration is given on the 
SPAN diagrams. 


NOTE: If one wants to converge onto the configuration of the full 
supply of objects by running different subsamples of individual objects 
on the same sequence of steps above, it can be done by superimposing 
their SPAN plots provided the CC5 analysis is preset on a common set of 
definer objects and dimensions, these being the set used in the CC5 
analysis above. 


Typological prediction 


In the condensation method, program OTYPE describes the M O-types in 
an IST file called OTYPEI. Typological prediction by program 4CAST 
reveals for each of the Л/ O-types described in OTYPEI its prediction of 
Scores on one or more variables. The sequence in typological prediction 
is as follows: 


а to 4: Local cards, START, COMMENT, DAP2. Same as in empirical run. 
DAP2 is not necessary if the predicted variables are taken from the 
FSCORI file on IST. 

e. TAKE program: reestablishes the OTYPEI file from the GIVE deck 
output by the condensation method. (If ОТУРЕЈ comes from OSTAT, 
the GIVE deck used should, of course, have been output by OSTAT.) 

J. 4CAST (forecast of predicted variables from the O-types) program: for 
each of the Л/ O-types, computes the mean, sigma, and homogeneity 
of scores on a predicted variable; deposits the NS scores of the full 
supply in the core of the computer and draws many random samples 
having the same number of cases as in the O-type and computes 
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the mean, sigma, and homogeneity of each sample; prints the dis- 
tribution of the sample values, spotting the observed value in the 


distribution; 


and states the probability of recovering sample values 


at least as deviant as the observed, for different levels of confidence. 
T'wo or more cards. 
1) Етесшіге саға. 

1-6 '"/4CAST" 


2) Component cards. 


Саға А. 
1-4 


5-8 


9-12 
13-16 
17-20 
29-32 
33-36 


Саға В. 


Parameters. Pack right. 

Blank means predicted variables are from IST file FSCORI; 
punch “1” if they are to come from DST via DAP2. 

Blank takes all the variables in the file; punch “1” to 
select only some variables for 4CAST analysis via detail 
card B below. 

Blank draws 300 random samples; punch the number 
wanted if not 300. 

Blank lists the scores of the variables; punch "1" to 
suppress list. 

Blank plots distribution of sample means; punch "0" to 
suppress it. 

Blank leaves raw scores as is; punch “1” to convert them 
to Z scores. 

Blank means complete data; "1" must be punched if any 
Scores are missing. 

Detail card used if variables are selected by a “1” punch 
in column 5-8 above. Punch the ordinal numbers of the 
selected variables (from DAP2) or clusters (from FSCORI 
on IST file) in free field integer format (commas between 
integers and a period at the end). 


g. END of job card. One card. 
1) Executive card. 


1-4 "/END" 


h. Local. 


Format statement in DAP2 and GIST 


DAP2—Card C. 
18 "INFORMAT" 


9-80 In columns 9-80, punch F, X, and А symbols describing the input 
format, thus: 


( (open parenthesis): the beginning of the format statement. 
) (close parenthesis): the end of the format statement. 
, а comma, separating each of the F, X, or A statements. 
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F the number of data fields. 
X the number of columns to be skipped. 
A the number of columns where O-names are punched. 
skip the rest of the columns and go to next card. 
For example, 
9-80 “(1Ғ1.0,11Х,4Ғ3.2,48Х,А6 30X, 5F8.4/ /20Х,Е8.3)'' 
means that there are four data cards per subjects, as follows: 
Саға 1: 
The first field is read from column 1 with the decimal point to the right 
of the column. 


Skip the next 11 columns. 
Read four fields from the next 12 columns with three columns per field 


and with decimal points at the second places from the right of all four 
fields. 
Skip the next 48 columns. 
Read the next six columns as O-names. (The ONAMS1 file will not be set 
up unless this “А6” statement is punched for this first card.) 
Go to card 2 (skipping the last two columns of card 1). 
Саға 2. 
Skip the first 30 columns. 
Read five fields from next 40 columns with the decimal points at fourth 
places from the right from all five fields. 
Go to card 3 (skipping the remaining columns of card 2). 
Card 3: 
Skip whole card and во to card 4. 
Саға 1. 
Skip the first 20 columns. 
Read the next eight columns as a field with decimal point at third place 


from the right of the field. 


Data fields end. 
GIST— Format statement in GIST is same as the DAP2 except no 


“INFORMAT" is necessary. Start “(”, open parenthesis, from 
column 1. 


Presetting defining variables of composites 
There are three types of composites for which one may wish to preset the 
defining variables: 


a. Preset collincar clusters for CC5-CSA2-SPA N2-F ACS. 
The definers are preset in CC5 and are thereafter carried in the 
REFLX1 file for CSA2, SPAN2, and FACS. 
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b. Composites for CSA2 and FACS. 
The definers are preset by the GIST (REFLX1) input package. Sir or 
more cards. 
1) Ехесште card. 

1-5 "GIST" 

Component cards. 

Саға А. File. 

1-6 "REFLX1" 

9-12 “1” for input file package. 

Саға B. Number of dimensions (clusters). 

1-4 Punch K, the number of dimensions. Pack right. 

Card C. Numbers of definers of the clusters. 

1-4, 5-8, 9-12, etc. The numbers of definers in each of the clusters 
are punched in successive four column fields. Pack right. 

Card D. (Deck). 


Card D1. Ordinal numbers (indexes) of the defining variables of 
cluster 1. 


2 


~ 


1-4, 5-8, 9-12, etc. Punch the indexes of the definers of cluster 1. 
Card 02, D3, etc., up to DK. Same as Card 01, but for clusters 2, 3, 
etc., up to cluster K. 
Саға E. Blank, if no other GIST cards follow. Mandatory. 
- Composites for FACS only. 
Defining variables of clusters can be directly specified in the control 
cards of FACS, and furthermore, there сап be up to NV definers of à 
Single cluster. Specification is by inputting the ‘‘weight matrix." See 
any printout of FACS for the format of the weight matrix. A card in 


decimal card format, i.e., (9F8.4), corresponds to each column of the 
Printed weight matrix. 


^ 
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